Health-Informed Policy Gradients for Multi-Agent Reinforcement Learning

This paper proposes a definition of system health in the context of multiple agents optimizing a joint reward function. We use this definition as a credit assignment term in a policy gradient algorithm to distinguish the contributions of individual a…

Authors: Ross E. Allen, Jayesh K. Gupta, Jaime Pena

Health-Informed Policy Gradients for Multi-Agent Reinforcement Learning
Health-Informed Policy Gradients for Multi- Agent Reinforcement Learning Ross E. Allen MI T Lincoln Laboratory Lexington, Massachusetts ross.allen@ll.mit.edu Jayesh K. Gupta Computer Science Stanford University jkg@cs.stanford.edu Jaime Pena MI T Lincoln Laboratory Lexington, Massachusetts jdpena@ll.mit.edu Y utai Zhou MI T Lincoln Laboratory Lexington, Massachusetts yutai.zhou@ll.mit.edu Javona White Bear MI T Lincoln Laboratory Lexington, Massachusetts jwbear@ll.mit.edu Mykel J. Kochenderfer Aeronautics & A stronautics Stanford University mykel@stanford.edu ABSTRA CT This pap er proposes a denition of system health in the context of multiple agents optimizing a joint reward function. W e use this denition as a credit assignment term in a policy gradient algorithm to distinguish the contributions of individual agents to the global reward. The health-informed credit assignment is then extended to a multi-agent variant of the proximal policy optimization algorithm and demonstrated on particle and multiwalker robot environments that have characteristics such as system health, risk-taking, semi- expendable agents, continuous action spaces, and partial obser v- ability . W e show signicant improvement in learning performance compared to policy gradient methods that do not perform multi- agent credit assignment 1 . KEY W ORDS Multi- Agent Reinfor cement Learning (MARL), Continuous Control, Multi-Robot Systems, Risk- T aking 1 IN TRODUCTION A utonomous robotic systems are commonly employed for tasks described as dull, dirty , and dangerous. Multi-robot systems are particularly well suited for dirty and dangerous tasks as they are robust to single-agent degradation and failur es. Such degradation can arise from damage to an agent’s sensors or actuators, thus limit- ing the agent’s ability to obser ve the environment and constricting the actions it may take. W e will use the term system health to refer to an agent’s current state of degradation relative to its nominal capabilities. While multi-agent systems may be an attractiv e solution for a range of real-world tasks, de veloping distributed decision-making 1 DISTRIBU TION ST ATEMENT A. Approved for public release. Distribution is unlim- ited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No . F A8702-15-D-0001. Any opinions, ndings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reect the views of the Under Secretary of Defense for Research and Engineering. © 2020 Massachusetts Institute of T echnology. Delivered to the U.S. Government with Unlimited Rights, as dened in DF ARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are dened by DF ARS 252.227-7013 or DF ARS 252.227- 7014 as detailed above. Use of this work other than as specically authorized by the U.S. Gov ernment may violate any copyrights that exist in this work. and control policies is still challenging. Decision-making in multi- agent systems can be modeled as a decentralize d partially observ- able Markov decision process (Dec-POMDP) [ 15 ]. In general, com- puting an optimal policy is NEXP-complete [ 3 ]. While various so- lution techniques based on heuristic search [ 22 , 30 , 33 ] and dy- namic programming [ 1 , 4 , 10 ] exist, they tend to quickly become intractable as the number of agents, states, and actions increase. One approach to appr oximate solutions to such problems is to use deep reinforcement learning ( deep-RL) [ 7 , 9 , 18 , 24 , 37 ]. How- ever , a fundamental challenge is the multi-agent credit assignment problem . When there are multiple agents acting simultane ously toward a shared objective, it is often unclear how to determine which actions of which agents are responsible for the joint reward. There may be strong inter-dependencies between the actions of dierent agents and long delays between joint actions and their eventual re wards [39]. This work fuses the concept of system health with de ep rein- forcement learning and provides three contributions to the eld of multi-agent decision-making. First, we outline a denition and properties for the concept of system health in the context of Markov decision processes. These denitions provide the framew ork for a subset of Dec-POMDPs that can be used to analyze multi-agent sys- tems operating in hazardous and adversarial environments. Second, we use the denition of health to formulate a multi-agent credit assignment algorithm to be used to accelerate and improve multi- agent reinforcement learning. Third, we apply the health-informed crediting algorithm within a multi-agent variant of pr oximal pol- icy optimization (PPO) [ 29 ] and demonstrate signicant learning improvements compared to existing algorithms in simulated en- vironments involving legged and particle robots with continuous action spaces. 2 RELA TED W ORK Applying deep-RL to multi-agent decision-making is an active area of research. Hernandez-Leal, Kartal, and T aylor [ 12 ] provide a com- prehensive survey of the eld. Gupta et al. [ 9 ] demonstrated how algorithms such as TRPO, DQN, DDPG, and A3C can be extended to a range of cooperative multi-agent problems. Lowe et al. [ 18 ] developed multi-agent deep deterministic policy gradients (MAD- DPG) that was capable of training in cooperative and competitive environments. Multi-agent reinforcement learning is challenging due to the problems of non-stationarity and multi-agent credit assignment . The non-stationary problem arises when a learning agent assumes all other learning agents are part of the environment dynamics. Since the individual agents are continuously changing their policies, the environment dynamics from the perspective of any one agent are continuously changing, thus breaking the Markov property [ 11 ]. While Lowe et al. [ 18 ] attempt to address the non-stationary prob- lem, MADDPG is still shown to become ineective at learning for systems with more than three or four agents. Gupta et al. [ 9 ] partially address the non-stationary problem through parameter sharing whereby groups of homogeneous agents use identical copies of parameters for their local policies. T erry et al. [ 35 ] provide a theoretical analysis of how information central- ization oered by parameter sharing alleviates some of the non- stationarity pr oblem. Ho wever , parameter sharing techniques don’t resolve the second fundamental challenge of multi-agent learn- ing: multi-agent credit assignment—i.e. the challenge of identifying which actions from which agent at which time w ere most responsi- ble for the overall performance (i.e. returns) of the system. Gupta et al. avoid explicit treatment of this pr oblem by focusing on envi- ronments where the joint rewar ds can be decompose d into local rewards. Ho wever , in general, such local reward structures are not guaranteed to optimize joint returns [39]. W olpert and T umer [ 39 ] developed the W onderful Life Utility (WLU) and Aristocrat Utility (AU), which are forms of “dierence rewards” . Both WLU and AU attempt to assign credit to individual agents’ actions by taking the dierence of utility received versus utility that would hav e been received had a dierent action been taken by the agent. The comparison between actual r eturns and hy- pothetical returns is sometimes referred to as counterfactual learn- ing [ 7 ]. Predating most of the advancements in deep reinfor cement learning, W olpert and Tumer’s work was restricted to small decision problems that could be handled in a tabular fashion [36, 39]. Foerster et al. [ 7 ] formulated an aristocrat-like crediting method that was able to leverage a deep neural network state-action value function (Q-value) within a policy gradient algorithm, referred to as counterfactual multi-agent (COMA) policy gradients. Using deep neural networks, they enabled crediting in large or continuous state spaces. Subsequent work on the multi-agent credit assignment led to the QMIX and Maven algorithms which factorized the Q-value across agents and demonstrated improved performance over COMA [ 19 , 25 ]. Ho wever COMA, QMIX, and Maven required enumeration over all actions and were thus r estricted to problems with discrete action spaces. Others have posed multi-agent, health-aware decision problems similar to the one we give in Section 3 [ 21 , 23 ]. These prior works use a planning-base d approach which assumes detailed, a priori knowledge of the underlying Dec-POMDP ’s dynamics. This is fun- damentally dierent from the learning-based approach we present in Section 4. 3 PROBLEM ST A TEMEN T The problems presented in this work can be modeled as decentral- ized partially obser vable Markov de cision processes (Dec-POMDPs), which are dened by the tuple ( I , S , A 𝑖 , Z 𝑖 , 𝑇 , 𝑅 ) . The set I rep- resents a nite set of 𝑛 agents. The set S is the joint state space of all agents (nite or innite). Assuming states are described in a vector form, let s ∈ S ⊆ R 𝑚 be a specic state of the system. The set A 𝑖 ( s ) is the action space of the 𝑖 th agent in joint state s . The vector u 𝑡 =  𝑎 1 , 𝑡 , 𝑎 2 , 𝑡 , . . . , 𝑎 𝑛,𝑡  represents a joint action at time 𝑡 where 𝑎 𝑖 , 𝑡 ∈ A 𝑖 ( s 𝑡 ) . The set Z 𝑖 ( s ) is the set of observations for the 𝑖 th agent in joint state s . The vector o 𝑡 =  𝑧 1 , 𝑡 , 𝑧 2 , 𝑡 , . . . , 𝑧 𝑛,𝑡  represents a joint observation at time 𝑡 where 𝑧 𝑖 , 𝑡 ∈ Z 𝑖 ( s 𝑡 ) . The transition function 𝑇 ( s ′ | s , u ) is the probability density associ- ated with arriving in state s ′ given the joint action u was taken in state s . The reward function 𝑅 ( s , u ) gives the immediate re- ward for taking the joint action u while in state s . The vector 𝜏 𝑖 , 𝑡 =  𝑧 𝑖 , 1 , 𝑎 𝑖 , 1 , 𝑧 𝑖 , 2 , 𝑎 𝑖 , 2 , . . . , 𝑧 𝑖 , 𝑡  represents the observation-action history for agent 𝑖 up to time 𝑡 . Using notation similar to Foerster et al. [ 7 ], we represent group-wide joint variables in bold and joint quantities that exclude a particular agent with the term ¬ 𝑖 . In order to solve the Dec-POMDP we se ek a joint p olicy π θ ( u | s ) , composed of a set of local policies 𝜋 𝜃 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 ) , that maximizes the discounted joint returns 𝐺 𝑡 = Í 𝑡 𝑓 − 𝑡 𝑙 = 0 𝛾 𝑙 𝑟 𝑡 + 𝑙 , where 𝛾 is the discount factor , 𝑟 is the empirical joint reward, and 𝑡 𝑓 is the nal time step in an episode or receding horizon. The local policies, parameterized by 𝜃 𝑖 , map an agent’s observation-action histor y to its next action at each time step. For a group of 𝑛 agents that follow independent stochastic policies 2 [7], we have π ( u | s , θ ) = 𝑛 Ö 𝑖 = 1 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) . (1) Using the common denition of the state value function of policy π , we dene 𝑉 π ( s 𝑡 ) = E s 𝑡 + 1: 𝑡 𝑓 , u 𝑡 : 𝑡 𝑓 [ 𝐺 𝑡 ] [ 28 ]. Similarly , we dene the joint state-action value function as 𝑄 π ( s 𝑡 , u 𝑡 ) = E s 𝑡 + 1: 𝑡 𝑓 , u 𝑡 + 1: 𝑡 𝑓 [ 𝐺 𝑡 ] . 3.1 System Health Here we intr oduce a concept we r efer to simply as system health , though ther e ar e alternativ e denitions used in the eld of prognos- tic decision making (PDM) [ 2 , 26 ]. If we represent the current state of the system with vector s , then the system health, h ∈ R 𝑛 , consti- tutes a subvector of s . Without loss of generality , we can dene the state vector as s = ( h , p ) , where p is the non-health components of the state. Each element of the the health vector corresponds to the health of an individual agent and is in the interval [ 0 , 1 ] , where 1 represents full health and 0 repr esents a fully degraded agent. W e dene the following properties asso ciated with reduction of system health: Property 1. Non-recoverable minimum health. Let s ℎ 𝑖 = 𝑐 represent any state vector where the health of agent 𝑖 is given as 𝑐 ∈ [ 0 , 1 ] . W e dene the non-increasing natur e of the health once minimum health has been reached (i.e. agent death) as follows 𝑇  s ′ ℎ 𝑖 = 𝛽 | s ℎ 𝑖 = 0 , u  = 0 for 𝛽 > 0 (2) Property 2. Constriction of reachable set in state space. Dene the reachable set of joint state s and constriction of reachable set as 2 Independent policies implies that agent 𝑖 ’s action at time 𝑡 is not conditioned on the actions of any other agent at time 𝑡 . Note that, in the case of parameter sharing where all agents have identical copies of parameters, 𝜃 𝑖 = 𝜃 , the local policies can still be independent. follows Let: R ( s ) = { s ′ ∈ S | ∃ u : 𝑇  s ′ | s , u  > 0 } ∴ R ( s ℎ 𝑖 = 𝛽 ) ⊆ R ( s ℎ 𝑖 = 𝛼 ) for 𝛼 > 𝛽 (3) Property 3. Constriction of available actions in action space. Let A 𝑖 ( s ) ⊆ A 𝑖 represent the available action set for agent 𝑖 when the system is in state s . Therefore the constriction of the available action set can be described as A 𝑖 ( s ℎ 𝑖 = 𝛽 ) ⊆ A 𝑖 ( s ℎ 𝑖 = 𝛼 ) for 𝛼 > 𝛽 (4) Property 4. Constriction of obser vable set in obser vation space. Let Z 𝑖 ( s ) ⊆ Z 𝑖 represent the set of possible observations for agent 𝑖 when the system is in state s . Therefore the constriction of observ- able set can be described as: Z 𝑖 ( s ℎ 𝑖 = 𝛽 ) ⊆ Z 𝑖 ( s ℎ 𝑖 = 𝛼 ) for 𝛼 > 𝛽 (5) T o provide real-world intuition about system health, we can consider multi-agent systems compose d of physical robots. In such a case, system health can b e thought of the state of damage or degradation of a physical r obot. A health of zero implies a robot has completely ‘crashed’ or been otherwise terminate d and Property 1 asserts that the robot will not become operational again. Proper- ties 2 and 3 describe the eect of damaging a robot’s actuators, thus partially or completely ‘ crippling’ its motion. Property 4 de- scribes the eect of damaging a robot’s sensors, thus limiting the observations it can make of the world. In this paper , we pay particular attention to the case of binar y health states , i.e. ℎ 𝑖 ∈ { 0 , 1 } , where zero health represents a complete constriction of an agent’s action space. In this case, agents are either fully functional or completely non-operational, where non- operational agents are incapable of selecting actions to interact with the environment. This is a natural formulation for many multi- agent problems, such as multi-agent computer games, where agents maintain their full functionality up until the moment they are non- operational. 4 APPRO A CH This section proposes a counterfactual learning algorithm that seeks to address the multi-agent credit assignment problem for sys- tems that embody the characteristics of system health. W e choose to restrict our scope to policy gradient methods b ecause of their scalability to large and continuous state and action spaces [ 6 ]. W e develop a health-informed policy gradient in Section 4.1 and use it to propose a new multi-agent PPO variant in Section 4.2. In general, each agent may be learning its own individual p olicy 𝜋 𝜃 𝑖 , howev er this can lead to a non-stationary environment from the perspective of any one agent and confound the learning process [ 11 ]. Instead, for this paper , we assume that agents are executing identical copies of the same policy 𝜋 𝜃 in a decentralized fashion, referred to as parameter sharing [ 9 , 13 , 35 ]. W e adopt a centralized learning, decentralized execution architecture whereby training data can be centralized between training episodes e ven if no such centralization of information is possible during execution—a common approach in the multi-agent RL literature [7, 9, 18]. 4.1 Health-Informed Multi-A gent Policy Gradients T o dev elop a p olicy gradient approach for health-based multi-agent systems, we begin with the policy gradient theorem [31] ∇ 𝐽 ( θ ) ∝  s 𝜇 ( s )  u 𝑞 π ( s , u ) ∇ π ( u | s , θ ) (6) Equation 6 gives an analytical e xpression for the gradient of the ob- jective function with respect to policy parameters θ , where 𝑞 π ( s , u ) is the true state-action value and 𝜇 ( s ) refers to the ergodic state distribution. T o develop a practical algorithm, we need a method for sampling such an analytical expression that has an expected value equal or approximately equal to Equation 6. Eq. (7) gives a multi-agent form for sampling the policy gradient considering local policies 𝜋 𝜃 𝑖 that map an individual agent’s local trajectory 𝜏 𝑖 , 𝑡 to actions 𝑎 𝑖 , 𝑡 : 𝑔 Ψ = E π " 𝑛  𝑖 = 1 Ψ 𝑖 , 𝑡 ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # (7) where the term Ψ can b e formulated to aect the bias and vari- ance of the learning process; see Appendix A for derivation [ 7 , 28 ]. T ypically Ψ takes the form of the return or discounted return 𝐺 𝑡 (i.e. REINFORCE algorithm [ 38 ]); the returns baselined on the state value function 𝑉 π ( s 𝑡 ) (i.e. REINFORCE with baseline [ 32 ]); the state-action value function 𝑄 π ( s 𝑡 , u 𝑡 ) ; the advantage function 𝐴 π ( s 𝑡 , u 𝑡 ) = 𝑄 π ( s 𝑡 , u 𝑡 ) − 𝑉 π ( s 𝑡 ) ; or the generalized advantage function 𝐴 GAE π [28, 29]. If all agents have the same policy parameters, r eceive identical rewards thr oughout an episode, and employ the same Ψ function, then Equation 7 renders the same gradient at each time step for all agents’ actions. This uniformity in policy gradients is problematic because it results in all actions at a given time step being promoted equally during the next training cycle—i.e . the multi-agent credit assignment problem. T o overcome this problem, we use the term Ψ 𝑖 , 𝑡 to distinguish gra- dients between dierent agents’ actions at the same time step. One option is to set Ψ 𝑖 , 𝑡 = 𝑄 𝜋  𝜏 𝑖 , 𝑡 , 𝑎 𝑖 , 𝑡  , which denes an observation- action function referred to as a local critic . Since all agents are ex- pected to receive distinct observations at each time step, 𝑄 𝜋  𝜏 𝑖 , 𝑡 , 𝑎 𝑖 , 𝑡  is expected to b e distinct for each agent at a given time step. How- ever , this approach r elies on making value approximations based on limited information, which can signicantly impact learning, as we demonstrate in Section 5. If we instead leverage our prior assump- tion of centralized learning [ 7 , 9 , 18 ], then we can make direct use of the central value functions, 𝑉 π ( s 𝑡 ) and 𝑄 π ( s 𝑡 , u 𝑡 ) ; however , this still does not resolve the multi-agent credit assignment problem. W e can use the concept of system health in an attempt to ad- dress the credit assignment problem. Our technique stems from the concepts of counterfactual baselines [ 7 ] or dierence rewards [ 20 ], and is further motivated by the W onderful Life Utility (WLU) [ 39 ]. The idea is that cr edit is assigned to an agent at a giv en time step by comparing the true joint return from time 𝑡 , with the expected re- turn fr om a hypothetical or “counterfactual” scenario wher e agent 𝑖 had been terminated at time 𝑡 . W e pr opose the following minimum- health baseline for multi-agent credit assignment Ψ 𝑖 , 𝑡 = ℎ 𝑖 , 𝑡  𝐺 𝑡 − 𝑉 π  s ¬ ℎ 𝑖 , 𝑡 𝑡 , ℎ 𝑖 , min   = ℎ 𝑖 , 𝑡 𝐺 𝑡 − 𝑏  s ¬ 𝑖  (8) where ℎ 𝑖 , 𝑡 is the health of agent 𝑖 at time 𝑡 and s ¬ ℎ 𝑖 , 𝑡 𝑡 represents the true joint state of the system at time 𝑡 , except that the health of agent 𝑖 is replaced with minimum health (typically 0). T o understand the multiplication by the true health ℎ 𝑖 , 𝑡 in Equa- tion 8, consider the implications of Properties 2 and 3 in Section 3.1. If reduction in system health constricts the available actions and reachable states at a given state, and these constrictions are not encoded within the policy 𝜋 𝑖 , then an inconsistency arises between chosen and executed actions that can aect training [ 8 ]. This would occur if the policy selects an action from a health state that the physical agent is incapable of executing. T o overcome this inconsistency , we modify the policy gradient such that the contribution from agent 𝑖 at time 𝑡 is multiplied by the true health of agent 𝑖 at time 𝑡 ; i.e. ℎ 𝑖 , 𝑡 ∈ [ 0 , 1 ] . The motivation for this modication is that, as an agent’s health degrades and the available action set is constricted, it b ecomes less likely that the action selected by the policy aligns with the action executed by the agent. By attenuating the policy gradient by the health of the agent, policy learning occurs mor e slowly on data generated in low health states and thus reduces the ee ct of mismatching chosen and executed actions. Section 5 investigates the case where the health state is binar y for each agent and a zero-health state shrinks the action set until only a zer o-vector action is executable by the agent. Lemma 1 states that gradient of the minimum-health baseline term in Eq. (8) is zero , and thus does not introduce bias to the gradient estimate in Eq. (7). Lemma 2 states that health-informe d p olicy gradients converge to a lo cally optimal policy for systems with binary health states. Lemma 1. Let the health-informed gradient in Eq. (7) b e written as 𝑔 Ψ = 𝑔 ℎ − 𝑔 𝑏 , then we have 𝑔 𝑏 = E π " 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # = 0 Proof. Applying Eq. (8) to Eq. (6) we can write 𝑔 Ψ =  s 𝜇 ( s )  u π ( u | s , θ ) · 𝑛  𝑖 = 1  ℎ 𝑖 𝑞 π ( s , u ) − 𝑏  s ¬ 𝑖   ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) = 𝑔 ℎ − 𝑔 𝑏 Separating out the baseline ’s gradient contribution and applying Eq. (1) 𝑔 𝑏 =  s 𝜇 ( s )  u π ( u | s , θ ) · 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ·  u ¬ 𝑖  𝑎 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) ∇ 𝜃 𝑖 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ·  u ¬ 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) ∇ θ  𝑎 𝑖 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ·  u ¬ 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) ∇ θ 1 = 0 □ Lemma 2. Given a system with binar y health states ℎ 𝑖 ∈ { 0 , 1 } , where ℎ 𝑖 = 0 constricts the action space to a single element, | A 𝑖  s ℎ 𝑖 = 0  | = 1 , following the gradient 𝑔 Ψ in Eq. (7) at each iteration 𝑘 gives lim 𝑘 →∞ ∥ ∇ θ 𝐽 ∥ = 0 (9) Proof. From Lemma 1 and Eq. (7), we hav e 𝑔 Ψ = 𝑔 ℎ = E π " 𝑛  𝑖 = 1 ℎ 𝑖 , 𝑡 𝐺 𝑡 ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # . With binary health states, we can separate the 𝑛 agents into gr oups based on their health at time 𝑡 such that I 1 = { 𝑖 : ℎ 𝑖 , 𝑡 = 1 } and I 0 = { 𝑖 : ℎ 𝑖 , 𝑡 = 0 } . For the group of agents in I 0 , the action space is constrained such that a single action is deterministically selected, regardless of 𝜃 𝑖 . Therefore, ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) = 0 , 𝑖 ∈ I 0 . As a result, w e can say for all 𝑖 ℎ 𝑖 , 𝑡 ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) = ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) . Applying this result and Eq. (1) gives 𝑔 Ψ = E π " 𝐺 𝑡 𝑛  𝑖 ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # = E π [ 𝐺 𝑡 ∇ θ log π θ ( u 𝑡 | s 𝑡 , θ ) ] , which is the gradient of the single-agent REINFORCE algorithm that has proven conv ergence properties for undiscounted returns [32, 38]. □ Equation 8 provides a health-informed counterfactual baseline on the state value function that is completely agnostic to the action space, a property not seen in prior work [ 7 , 19 , 20 , 25 , 39 ]. This makes our approach well suited for use in policy optimization techniques for large or continuous action spaces, such as TRPO and PPO [27, 29]. Other recent work on multi-agent credit assignment draws in- spiration from WLU and Aristocrat Utility ( AU). Nguyen et al. [ 20 ] oer a modern form of WLU that requires maintaining explicit counts of discrete actions and state visitations and is not well posed for continuous domains and de ep-RL. Modern implementations of Aristocrat Utility (AU) such as COMA [ 7 ] require enumeration over all possible actions or computationally expensive Monte Carlo analysis at each time step. The health-informe d baseline suers no such limitations. 4.2 Health-Informed Multi-A gent Proximal Policy Optimization While the health-informed credit assignment technique describ ed in Se ction 4.1 is applicable to any reinforcement learning algo- rithm that uses value functions, we choose to demonstrate our technique within a multi-agent variant of proximal policy opti- mization (PPO) [ 29 ]. PPO is chosen because it has been shown to work well with continuous action spaces [ 6 ] as well as multi-agent environments [24]. Our health-informed counterfactual baseline is evaluated using a centralize d critic, 𝑉 𝑤 ( s 𝑡 ) , which we model with a deep neural net- work parameterized by weights 𝑤 . As with the original PPO paper [ 29 ], the centralized critic is trained using generalized advantage estimation [28] where centralized value targets are dened as 𝑉 targ 𝑡 = 𝐴 GAE 𝑡 + 𝑉 𝑤 old ( s 𝑡 ) , (10) and the centralized value loss function is 𝐿 VF ( 𝑤 ) =  𝑉 𝑤 ( s 𝑡 ) − 𝑉 targ 𝑡  2 . (11) Now replacing the returns, 𝐺 𝑡 , in Eq. (8) with the more general value targets, our counterfactual baseline becomes Ψ 𝑖 , 𝑡 = ℎ 𝑖 , 𝑡  𝑉 targ 𝑡 − 𝑉 𝑤 old  s ¬ 𝑖 𝑡   (12) W e apply the health-informe d counterfactual baseline in Eq. (12) to PPO’s surrogate objective function and formulate the clipped surrogate objective function 𝐿 ( 𝜃 ) = E 𝑡  𝜋 𝜃 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 ) 𝜋 𝜃 old ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 ) Ψ 𝑖 , 𝑡  = E 𝑡  𝜌 𝑖 , 𝑡 ( 𝜃 ) Ψ 𝑖 , 𝑡  (13) 𝐿 CLIP + S ( 𝜃 ) = E 𝑡 [ 𝑐𝑆 𝜋 𝜃 + min ( 𝜌 𝑖 , 𝑡 ( 𝜃 ) Ψ 𝑖 , 𝑡 , clip  𝜌 𝑖 , 𝑡 ( 𝜃 ) , 1 − 𝜖 , 1 + 𝜖  Ψ 𝑖 , 𝑡 ) ] . (14) As in the original PPO paper , we also ne ed to train the value function network and augment with an entropy bonus— 𝑆 𝜋 𝜃 weighted by hyperparameter 𝑐 —to encourage exploration [29]. Algorithm 1 gives the training process for a group of 𝑛 coop- erative agents in a centralized-learning, decentralized-execution framework using the health-informed multi-agent proximal policy optimization (MAPPO) algorithm and parameter sharing. Algorithm 1 Health-Informed Multi-A gent PPO Initialize 𝜃 and 𝑤 for 𝑖 𝑡 𝑒 𝑟 𝑎𝑡 𝑖 𝑜𝑛 = 1 , 2 , . . . do Run local policies 𝜋 𝜃 on 𝑛 agents for 𝑡 𝑓 timesteps Compute value targets 𝑉 targ 𝑡 , ∀ 𝑡 ∈ { 1 , . ., 𝑡 𝑓 } Compute Ψ 𝑖 , 𝑡 with Eq. (12) ∀ 𝑖 , ∀ 𝑡 Compute 𝜃 ′ with PPO update in Eq. (14), with 𝐾 epochs and minibatch size 𝑀 ≤ 𝑛𝑡 𝑓 Compute 𝑤 ′ with Eq. (11) Update 𝜃 ← 𝜃 ′ , 𝑤 ← 𝑤 ′ end for 5 EXPERIMEN TS This section presents experiments to demonstrate health-informed credit assignment in multi-agent learning. T wo separate imple- mentations of health-informed proximal policy optimization were developed. The rst implementation was built with a T ensorFlow framework within a forked version of OpenAI’s Multi- Agent Par- ticle Environment (MPE) library that is extended to incorporate the concepts of system health and risk-taking [ 17 , 18 ]. T o take advantage of recently developed multi-agent RL toolsets, a sec- ond implementation was built with PyT orch within RLlib [ 16 ] and trained in the PettingZoo multiwalker environment [9, 34]. Since the original MPE environments wer e dedicated to small groups of agents and did not incorporate the concept of health, new scenarios were developed. 3 The two MPE scenarios, titled hazardous navigation and hazardous communication network , and the PettingZoo multiwalker environment are described later in this section. The experiments compare multi-agent deep deterministic policy gradients (MADDPG) and QMIX (for RLlib implementation only ) [ 25 ] with three multi-agent variants of proximal policy optimiza- tion (MAPPO) referred to as local critic , central critic and min-health crediting . The local critic MAPPO uses an advantage function based on local observation value estimates Ψ 𝑖 , 𝑡 = 𝐴 GAE 𝑡  𝜏 𝑖 , 𝑡 𝑓  . The cen- tral critic MAPPO uses an advantage function based on joint state value estimates enabled by the centralized learning assumption: Ψ 𝑖 , 𝑡 = 𝐴 GAE 𝑡 ( s 𝑡 ) . The minimum-health crediting MAPPO uses the health-informed counterfactual baseline in Equation 12. Hazardous navigation. This environment is closely related to Lowe ’s “ cooperative navigation” environment where agents must cooperate to reach a set of landmarks [ 18 ]. Reward is based on the distance from each landmark to the nearest agent, thus all agents receive the same reward and each landmark should be ‘cover ed’ by a separate agent in order to maximize re ward. Our variation of this problem incorporates a hazardous landmark that can pr obabilisti- cally cause an agent to b e terminated (i.e. spontaneously transition to a zero-health state) if the agent is within a threshold distance of the hazard. The landmark that p oses a hazar d is not known until at least one agent crosses its threshold distance. T o connect this scenario to a real-world pr oblem, consider the use of uninhabited aerial vehicles (U A V s) to monitor wildres. Each spot re must be 3 See Appendix B for a comparison between MADDPG and our proposed variants of MAPPO in an environment taken directly from the original MADDPG paper [18]. (a) (b) (c) Figure 1: The 8-agent hazard navigation (1a), 16-agent hazardous communication (1b), and 5-agent multiwalker (1c) environ- ments. (1a) black dots represent landmarks to b e covered by agents. The red dot is the hazardous landmark that has been ‘revealed’ by an agent in its vicinity . (1b) The larger black dots represent the terminals to be conne cted. The smaller dots represent agents. Links are formed between agents that are within each other’s communication radius. The red dot is the en- vironmental hazard. Tiny red crosses repr esent agents that have been terminated due to proximity to the hazard. (1c) W alker balance the package on their heads and walk together . Fallen agents are given a zero-health and rendered immobilized continuously monitored and one spot re poses signicant risk to any U A V within its proximity . Figure 1a is a snapshot of the hazardous navigation environment. Hazardous communication network. This envir onment con- sists of two xed landmark terminals and a group of mobile agents that can serve as communication relays over short distances. The objective is for the agents to coop eratively arrange themselv es into an uninterrupted chain linking the two terminals. All agents re- ceive the same re ward for every time step in which the terminals are connected, and zero rewar d when the link is broken. For each episode an environmental hazard is randomly placed between the terminals, which causes agents in its vicinity to be terminated with some probability 𝑝 fail . Figure 1b shows a snapshot of the hazardous communication network scenario. Multiwalker . This environment, originally pr esented in Gupta et al. [ 9 ] and adapted into the PettingZoo library [ 34 ], consists of 𝑁 robots that must collaboratively carry a package. The group of robots is rewar ded based on the distance the package has mov ed. Each agent observes the relative pose of neighboring walkers as well as the pose of the package. As control input, each agent sele cts the joint torques to apply to their leg joints. The agents must learn to walk as well as carry the package in or der to achieve high reward. Unlike the original implementation of multiwalker that used local reward shaping to avoid the multi-agent credit assignment problem [ 9 ], this version of the problem enfor ces a joint reward signal. If an agent falls to the ground, it’s health is set two zero and can take no further action. Figure 1c gives a snapshot of the multiwalker environment. All three environments are characterized by partially observed state spaces, continuous action spaces, joint rewards, and agent attrition; making them particularly challenging for multi-agent reinforcement learning. 5.1 Results Figure 2 summarizes the results of our experiments. A few trends that are common to all campaigns immediately emerge . Most no- tably , we see that MAPPO with health-informed crediting tends to outperform all other multi-agent RL algorithms for each of the environments and group sizes tested. Furthermore , by comparing 2a-to-2b and 2c-to-2d, we see that the p erformance gap between health-informed crediting and other algorithms tends to increase as the number of agents increases. W e expect such a trend b ecause, as the number of agents in the environment increases, the more pronounced the multi-agent credit assignment problem becomes, and therefore crediting algorithms such as Equation 8 should show increasing benets. In cases where health-informed crediting only slightly outperforms a non-crediting approach (such as 2a, 2b , and 2f), we see that the health-informed baseline has the added benet of reducing variance between trials within a training campaign. This can be seen by comparing the relative sizes of the blue shaded regions with the orange shaded regions. In general, centralized critics—which include MAPPO: central critic and MAPPO: min-health crediting—tend to outperform lo cal critics and always outperform MADDPG and QMIX for the envi- ronments tested. MADDPG’s underperformance is likely due to the fact that these environments consist of multi-agent group sizes considerably larger than those developed in the original MADDPG work. This would explain why MADDPG shows its best perfor- mance in Figure 2a which is the scenario most closely aligned with the original “cooperative navigation” environment [18]. The poor perofmance of QMIX is almost certainly due to the fact the algorithm is fundamentally designed for discrete action spaces but our environments all consist of continuous action spaces. In order to run QMIX, the action space was discretized into eight action bins per joint, but this did not give sucient r esolution to enable QMIX to learn a suitable policy . This highlights the advantage of credit assignment with a policy optimization algorithm (i.e PPO) in contrast to value-based methods like QMIX and COMA. 0.0 0.5 1.0 1.5 2.0 2.5 timesteps 1e6 −1600 −1400 −1200 −1000 −800 total rewards per episode Learning Curves for 4-Agent Navigation Scenario MAPPO: min-health crediting MAPPO: central critic MAPPO: local critic MADDPG (a) 0.0 0.5 1.0 1.5 2.0 2.5 timesteps 1e6 −14000 −12000 −10000 −8000 −6000 −4000 total rewards per episode Learning Curves for 8-Agent Navigation Scenario MAPPO: min-health crediting MAPPO: central critic MAPPO: local critic MADDPG (b) 0.0 0.5 1.0 1.5 2.0 2.5 timesteps 1e6 0 500 1000 1500 2000 2500 3000 total rewards per episode Learning Curves for 10-Agent Comm Scenario MAPPO: min-health crediting MAPPO: central critic MAPPO: local critic MADDPG (c) 0.0 0.5 1.0 1.5 2.0 2.5 timesteps 1e6 0 1000 2000 3000 4000 5000 6000 total rewards per episode Learning Curves for 16-Agent Comm Scenario MAPPO: min-health crediting MAPPO: central critic MAPPO: local critic MADDPG (d) (e) (f ) Figure 2: Learning cur ves for the hazardous navigation (2a, 2b), hazardous communication (2c, 2d), and multiwalker (2e, 2f) environments. Each campaign denes an environment and the number of agents present in that environment and produces a learning curve for MADDPG, three variants of MAPPO, and QMIX (note: QMIX is only implemented in multiwalker envi- ronment where RLlib was used as the training engine). A single learning curve is an average over four independent training experiments with the shaded region representing the minimum and maximum bounds of the four training experiments. MAD- DPG is shown in red crosses. MAPPO with a local critic is shown in green triangles. MAPPO with a non-crediting central critic is shown in orange X’s. MAPPO with a central critic and the health-informe d, counterfactual baseline given in Equation 8 (referred to as min-health crediting ) is shown in blue circles. QMIX is shown in purple diamonds. The algorithms and envi- ronments used to generate these results are publicly available at: https://github.com/rallen10/ergo_particle_gym A local critic tends to learn more quickly , outperforming other algorithms in the short term, but then plateaus and is overtaken by central critic approaches. An interesting e xception to this trend is the 5- Agent multiwalker experiment (2f) where local-critic MAPPO appears to perform on par with the health-informe d MAPPO . The local critic MAPPO aggregate performance is heavily inuence d by a trial that seemed to discover an exploit in the environment that produced high rewards with little coordination between the walkers; the agents learned a shaking/jerking motion—instead of a walking motion—that slid the package forward like an object moving on a shaking table. The local critic algorithm was not able to reproduce the behavior in other training, thus the wide variance in experiments. Experiments wer e run on Intel Xeon E5-2687W CP Us in a 32- core Linux desktop. The particle environment experiment were run for 50,000 episodes with each episode consisting of 50 time steps and training batches comp osed of 256 episodes. Training batches were broken into 8 minibatches and run over 8 ep ochs. Multiwalker trained for 5 million timesteps, with training batches of 16384 timesteps, minibatches 4096 timesteps, and 32 epochs. For the multi-agent PPO e xperiments an entropy coecient of 0.01 was used for particle envir onments to ensure sucient exploration [ 29 ] and 0.0 in multiwalker to stabilize training. For all experiments represented in Figure 2 the policy netw ork was composed of a multilayer perceptron (MLP) with 2 fully con- nected hidden layers, each of which b eing 64 units wide, and a hy- berbolic tangent activation function. For experiments that utilized a local critic the value function network matched the architecture of the policy network. For experiments that utilized a centralized critic the value function network had a distinct architecture that was developed empirically . Such centralized critic networks con- sisted of a 8-layer by 64-unit, fully connected MLP that used an exponential linear unit (ELU) activation function [ 5 ]. W e obser ve that the ELU activation function tended to outperform the rectied linear unit (ReLU) and hyperbolic tangent activation functions for central critic learning. For the particle environments an actor learning rate of 1 . 0 × 10 − 3 and a central critic learning rate of 5 . 0 × 10 − 3 was used with the Adam optimizer [ 14 ]. For the multiwalker environment run with RLlib, a learning rate of 3 × 10 − 4 was used. 6 CONCLUSIONS In this paper we have proposed a denition for system health and shown how it can be use d in policy gradient metho ds to improve multi-agent reinforcement learning in a certain class of Dec-POMDPs. The techniques presented here are well suited for solving continuous-control multi-robot coordination problems in hazardous environments such as search and rescue (e.g. exploring burning or collapsing buildings) disaster relief (e .g. mapping wild- res or toxic chemical leaks with groups of U A V s) and coordinated load carrying. These techniques are also well suited for reinfor ce- ment learning in multi-agent adversarial game environments that exhibit agent attrition, such as StarCraft II and DOT A 2 [24, 37]. This work raises several questions that merit futur e investiga- tion. The logical next step would be to explore whether a similar form of counterfactual reasoning can help address multi-agent credit assignment in continuous-control environments that are not characterized by system health. This p erhaps could be achieved by generating training data in environments with fewer agents than the target environment, but it is uncertain what side eects this type of “o-envir onment" experience would have on training. Additionally , the combination of the action space constriction in Property 3 and health-informed policy gradient in Eq. (7) and Eq. (8) highlight the need for further investigation of policy gradients on systems where actions chosen by the policy do not exactly match the actions executed by an agent or agents [8]. REFERENCES [1] Edward Balaban, Stephen B. Johnson, and Mykel J. Kochenderfer . 2019. Unifying System Health Management and Automated De cision Making. Journal of A rticial Intelligence Research 65 (Aug. 2019), 487–518. https://doi.org/10.1613/jair .1.11366 [2] Edward Balaban, Sriram Narasimhan, Matthew Daigle, Indranil Ro ychoudhury , Adam Sweet, Christopher Bond, and G Gorospe. 2013. Development of a mobile robot test platform and methods for validation of prognostics-enabled decision making algorithms. International Journal of Prognostics and Health Management 4, 1 (2013), 87. [3] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002. The complexity of decentralized control of Marko v decision processes. Mathe- matics of Operations Research 27, 4 (2002), 819–840. [4] Abdeslam Boularias and Brahim Chaib-draa. 2008. Exact dynamic programming for decentralized POMDPs with lossless policy compression. In International Conference on Automated Planning and Scheduling (ICAPS) . AAAI Press, 20–27. [5] Djork- Arné Clevert, Thomas Unterthiner , and Sepp Hochreiter . 2015. Fast and accurate de ep network learning by e xponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015). [6] Y an Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016. Benchmarking deep reinforcement learning for continuous control. In Interna- tional Conference on Machine Learning (ICML) . 1329–1338. [7] Jakob N Foerster , Gregory Farquhar , Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In AAAI Conference on Articial Intelligence (AAAI) . [8] Y asuhiro Fujita and Shin-ichi Maeda. 2018. Clippe d action policy gradient. arXiv preprint arXiv:1802.07564 (2018). [9] Jayesh K Gupta, Maxim Egorov , and Mykel Kochenderfer . 2017. Cooperative multi- agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS) . Springer , 66–83. [10] Eric A Hansen, Daniel S Bernstein, and Shlomo Zilberstein. 2004. Dynamic programming for partially observable stochastic games. In AAAI Conference on A rticial Intelligence (AAAI) , V ol. 4. 709–715. [11] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. 2017. A survey of learning in multiagent environments: Dealing with non-stationarity . arXiv preprint arXiv:1707.09183 (2017). [12] Pablo Hernandez-Leal, Bilal Kartal, and Matthe w E T aylor . 2019. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi- Agent Systems 33, 6 (2019), 750–797. [13] Maximilian Hüttenrauch, Adrian Šošić, and Gerhard Neumann. 2018. Local communication protocols for learning complex swarm behaviors with de ep rein- forcement learning. In International Conference on Swarm Intelligence . Springer , 71–83. [14] Diederik P Kingma and Jimmy Ba. 2014. Adam: A metho d for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014). [15] Mykel J Kochenderfer . 2015. Decision making under uncertainty: The ory and application . MIT Press. [16] Eric Liang, Richard Liaw , Robert Nishihara, P hilipp Moritz, Roy Fox, K en Gold- berg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. RLlib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning . 3053–3062. [17] Ryan Lowe. 2018. Multi-agent particle environment . https://github.com/openai/ multiagent- particle- envs [18] Ryan Lowe, Yi Wu, A viv T amar , Jean Harb, Pieter Abb eel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (NIPS) . 6379–6390. [19] Anuj Mahajan, T abish Rashid, Mikayel Samvelyan, and Shimon Whiteson. 2019. Maven: Multi-agent variational exploration. In A dvances in Neural Information Processing Systems . 7613–7624. [20] Duc Thien Nguyen, Akshat Kumar , and Hoong Chuin Lau. 2018. Credit assign- ment for collective multiagent RL with global rewards. In Advances in Neural Information Processing Systems (NIPS) . 8102–8113. [21] Frans A Oliehoek, Julian FP Kooij, and Nikos Vlassis. 2008. The cross-entropy method for p olicy search in decentralized POMDPs. Informatica 32, 4 (2008), 341–357. [22] Frans A Oliehoek, Shimon Whiteson, and Matthijs TJ Spaan. 2013. Approximate solutions for factored Dec-POMDPs with many agents. In International Conference on A utonomous Agents and Multiagent Systems (AAMAS) . 563–570. [23] Shayegan Omidshaei, Ali-akbar Agha-mohammadi, Christopher Amato, Shih- Y uan Liu, Jonathan P How , and John L Vian. 2016. Health-aware multi-U A V planning using decentralized partially observable semi-Markov decision pro- cesses. In AIAA Infotech@ Aerospace . 1407. [24] OpenAI. 2019. OpenAI Five . https://openai.com/ve/#how- openai- ve- works [25] T abish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Far- quhar , Jakob Foerster , and Shimon Whiteson. 2018. QMIX: Monotonic value function factorisation for deep multi-agent reinfor cement learning. arXiv preprint arXiv:1803.11485 (2018). [26] Abhinav Saxena, Jose Celaya, Edward Balaban, Kai Goebel, Bhaskar Saha, Sankalita Saha, and Mark Schwabacher . 2008. Metrics for evaluating perfor- mance of prognostic techniques. In IEEE International Conference on Prognostics and Health Management . 1–17. [27] John Schulman, Sergey Levine , Pieter Abbe el, Michael Jordan, and Philipp Moritz. 2015. T rust region policy optimization. In International Conference on Machine Learning (ICML) . 1889–1897. [28] John Schulman, Philipp Moritz, Sergey Le vine, Michael Jor dan, and Pieter Abbeel. 2016. High-dimensional continuous control using generalized advantage estima- tion. In International Conference on Learning Representations . [29] John Schulman, Filip W olski, Prafulla Dhariwal, Ale c Radford, and Oleg Klimov . 2017. Proximal policy optimization algorithms. arXiv preprint (2017). [30] Matthijs TJ Spaan, Frans A Oliehoek, and Christopher Amato. 2011. Scaling up optimal heuristic search in Dec-POMDPs via incremental expansion. In Interna- tional Joint Conference on Articial Intelligence (IJCAI) . [31] Richard S Sutton and Andrew G Barto . 2018. Reinforcement learning: An intro- duction. MIT Press, Chapter 13, 321–338. [32] Richard S Sutton, David A McAllester , Satinder P Singh, and Yishay Mansour . 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems . 1057–1063. [33] Daniel Szer and François Charpillet. 2005. An optimal best-rst search algorithm for solving innite horizon Dec-POMDPs. In Eur opean Conference on Machine Learning (ECML) . Springer , 389–399. [34] Justin K T err y , Benjamin Black, Ananth Hari, Luis Santos, Clemens Dieendahl, Niall L Williams, Y ashas Lokesh, Caroline Horsch, and Praveen Ravi. 2020. PettingZoo: Gym for Multi-A gent Reinforcement Learning. arXiv preprint arXiv:2009.14471 (2020). [35] Justin K T err y , Nathaniel Grammel, Ananth Hari, Luis Santos, Benjamin Black, and Dinesh Manocha. 2020. Parameter Sharing is Surprisingly Useful for Multi- Agent Deep Reinforcement Learning. arXiv preprint arXiv:2005.13625 (2020). [36] Kagan Tumer , Adrian K Agogino, and David H W olp ert. 2002. Learning sequences of actions in collectives of autonomous agents. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS) . 378–385. [37] Oriol Vinyals, Igor Babuschkin, W ojciech M Czarnecki, Michaël Mathieu, An- drew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev , et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350–354. [38] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3-4 (1992), 229–256. [39] David H W olp ert and K agan T umer . 2002. Optimal payo functions for members of collectives. In Modeling Complexity in Economic and Social Systems . W orld Scientic, 355–369. A EXTENDED PROOFS FOR MULTI- A GEN T POLICY GRADIEN TS Here elaborate on the e quations presented in Section 4.1. Using Equation 1, we can derive a multi-agent p olicy gradient—without health-informed baseline—from the single-agent policy gradient. W e then use this result to provide further detail on the intermediate steps in Lemma 1. Lemma 3. For 𝑛 indep endent agents per Eq. (1) ∇ 𝐽 ( θ ) ∝ 𝑔 = E π " 𝑛  𝑖 = 1 𝐺 𝑡 ∇ 𝜃 𝑖 log 𝜋 𝜃 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # Proof. From Sutton and Barto [31] and Eq. (1) we hav e: 𝑔 =  s 𝜇 ( s )  u 𝑞 π ( s , u ) ∇ θ π ( u | s , θ ) =  s 𝜇 ( s )  u 𝑞 π ( s , u ) π ( u | s , θ ) · ∇ θ log π ( u | s , θ ) =  s 𝜇 ( s )  u 𝑞 π ( s , u ) π ( u | s , θ ) · ∇ θ log 𝑛 Ö 𝑖 = 1 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s )  u 𝑞 π ( s , u ) π ( u | s , θ ) · ∇ θ 𝑛  𝑖 = 1 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s )  u 𝑞 π ( s , u ) π ( u | s , θ ) · 𝑛  𝑖 = 1 ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) , which gives the intermediate result 𝑔 =  s 𝜇 ( s )  u π ( u | s , θ ) · 𝑛  𝑖 = 1 𝑞 π ( s , u ) ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) . (15) Continuing the derivation by replacing sums over variables with expectations of samples drawn following the policy π [31] 𝑔 = E π "  u π ( u | s 𝑡 , θ ) · 𝑛  𝑖 = 1 𝑞 π ( s 𝑡 , u ) ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # = E π " 𝑛  𝑖 = 1 𝑞 π ( s 𝑡 , u 𝑡 ) ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # . Finally arriving at 𝑔 = E π " 𝑛  𝑖 = 1 𝐺 𝑡 ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 , 𝑡 | 𝜏 𝑖 , 𝑡 , 𝜃 𝑖 ) # (16) □ From Eq. (16) w e can then arrive at Eq. (7) by substituting Ψ 𝑖 , 𝑡 for 𝐺 𝑡 in order to ae ct the bias, variance, and credit assignment in the policy gradient. T o investigate the convergence of the health-informed policy gradient, let us apply Equation 8 in place of 𝑞 π ( s , u ) in the nal term in Equation 15. W e have 𝑔 Ψ =  s 𝜇 ( s )  u π ( u | s , θ ) 𝑛  𝑖 = 1 ·  ℎ 𝑖 𝑞 π ( s , u ) − 𝑏  s ¬ 𝑖   ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s )  u π ( u | s , θ ) · 𝑛  𝑖 = 1 ℎ 𝑖 𝑞 π ( s , u ) ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) + − 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) ! = 𝑔 ℎ − 𝑔 𝑏 Now we show that 𝑔 𝑏 = 0 . Note that this is e quivalent to the proof given in Lemma 1 but with more intermediate steps included to help the reader follow . 𝑔 𝑏 =  s 𝜇 ( s )  u π ( u | s , θ ) · 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s )  u 𝑛  𝑖 = 1 π ( u | s , θ ) · 𝑏  s ¬ 𝑖  ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1  u π ( u | s , θ ) · 𝑏  s ¬ 𝑖  ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖  ·  u π ( u | s , θ ) ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) · 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) ∇ 𝜃 𝑖 log 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) · ∇ 𝜃 𝑖 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) 0.0 0.5 1.0 1.5 2.0 2.5 timesteps 1e6 900 800 700 600 500 400 300 total rewards per episode Learning Curves for 3-Agent Navigation Scenario MAPPO: min-health crediting MAPPO: central critic MAPPO: local critic MADDPG Figure 3: Learning curves for the cooperative navigation problem with 3 agents and no hazardous landmarks . Each of the four learning cur ve is an average over four indepen- dent training experiments with the shade d region represent- ing the minimum and maximum bounds of the four training experiments. =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u ¬ 𝑖  𝑎 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) · ∇ 𝜃 𝑖 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u ¬ 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) ·  𝑎 𝑖 ∇ 𝜃 𝑖 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u ¬ 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) · ∇ θ  𝑎 𝑖 𝜋 𝑖 ( 𝑎 𝑖 | 𝜏 𝑖 , 𝜃 𝑖 ) =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u ¬ 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) · ∇ θ 1 =  s 𝜇 ( s ) 𝑛  𝑖 = 1 𝑏  s ¬ 𝑖   u ¬ 𝑖 𝑛 Ö 𝑗 = 1 , 𝑗 ≠ 𝑖 𝜋 𝑗 ( 𝑎 𝑗 | 𝜏 𝑗 , 𝜃 𝑗 ) 0 = 0 W e should expect this result since the baseline term 𝑏  s ¬ 𝑖  is independent of actions—using the same logic pro vided by Sutton and Barto [ 31 ]—howev er it requires more manipulation to prove it for the multi-agent case. B NON-HEALTH-BASED EN VIRONMEN TS The environments described in Section 5 were designe d to incor- porate the concept of system health. It could be argued that these types of environments are outside the scope of what for which MADDPG was orginially designed, and therefor e not a fair point of comparison. T o address this concern, we ran our three variants of multi-agent PPO on the cooperative navigation environment that appears in the original MADDPG paper [18]. Figure 3 shows MADDPG successfully learning an eective pol- icy over a 50,000 episode training experiment. This learning curve is what we e xpect to se e given that this environment was originally developed in conjunction with MADDPG. It is shown, however , that MADDPG underpeforms all of the MAPPO implementations except the lo cal-critic MAPPO. The central-critic, non-crediting variant of MAPPO produces the best performance, outperforming the minimum-health baseline crediting variant. This is contrast to the results for the hazardous communication scenario discussed in Se ction 5. However , this is to be expected due to the fact that the coop erative navigation environment using in Fig. 3 does not encapsulate the concepts of health or risk. Ther efore no rele vant data is generated that trains the value function network how to estimate the value of the minimum-health baseline in Eq. (8). W e see that this causes the minimum-health crediting technique to dis- play high variance b etween training experiments. The blue shaded region indicates that worst-performing training run on minimum- health MAPPO under-performs all other algorithms, while the best- performing training run out performs all other algorithms.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment