A Review of Cooperative Multi-Agent Deep Reinforcement Learning

A R E V I E W O F C O O P E R A T I V E M U LT I - A G E N T D E E P R E I N F O R C E M E N T L E A R N I N G Afshin Oroojlooy and Da vood Hajinezhad {afshin.or oojlooy, davood.hajinezhad }@sas.com SAS Institute Inc., Cary , NC, USA A BS T R AC T Deep Reinforcement Learning has made si gniﬁcant progress i n mul ti-agent sys- tems in recent years. In this revie w a rticle, we hav e focused on presenting recent approaches on Mu lti-Agent Reinforcement Learning (MARL) algorithms . In par - ticular , we have focused on ﬁve common approaches on mo d eling and solvi n g co- operativ e multi-agent reinforcement l earning problems: (I) independent learners, (II) fully observable critic, (III) v alue function factorization, (IV) consensus, and (IV) learn to communi cate. First, we elaborate on each of t hese methods, possi- ble challenges, and how these challenges were m itigated in the rele vant papers. If applicable, we further make a connection among different paper s in each category . Next, we co ver some new emer ging research areas in MARL along with the relev ant recent papers. Due t o the recent success of M ARL in real-world appl ications, we assign a section to provide a revie w of these applications and correspond ing articles. Also, a list of av ailable en vi ronments for MARL research i s provided in t h is s u rve y . Finally , the paper i s concluded with p rop osals on the po s sible research d i rections. Ke ywords: Reinforcement Learning, Mult i-agent systems, Cooperative. 1 Intr oduction Multi-Agent Reinforcement Learning (MARL) algorithms are deali n g with s y stems consistin g of sev eral agents (robots, m achines, cars, etc.) which are i n teracting wi thin a comm o n en vironment. Each agent makes a decision in each time-step and works along with the oth er agent(s) to achieve an indi vidual predetermi n ed goal. The goal of M ARL algorit h ms is to learn a p olicy for each agent such that all agents together achieve the goal of the system . Particularly , the agents are learnable units that aim to learn an optimal policy on the ﬂy to maximi ze the long-term cumulative discounted rewar d thro u g h the interaction wi th the en vironment. Du e to t he complexities o f the en vi ronments or th e com binatorial nature of t he problem, training the agents is typically a challeng- ing task and se veral problems which MARL deals with them are categorized as NP-Hard p rob lems, e.g. m anufac turing scheduling ( Gabel and Riedmi l ler 2007 , Dittrich and Fohlmeister 2020 ), vehi- cle routin g problem ( Silva et al. 2019 , Zhang et al. 2020 b ), s ome multi-agent g ames ( Bard et al. 2020 ) are onl y a few examples to menti o n. W i t h the motiv ation of recent success on deep reinforcement learning (RL)—super-human lev el control on Atari games ( Mnih et al. 2015 ), mastering t he game of Go ( Silver et al. 2016 ), chess ( Silver et al. 2017 ), robotic ( K ober et al. 2 013 ), healt h care planning ( Liu et al. 2017 ), power grid ( Glavic et al. 2017 ), routing ( Nazari et al. 2018 ), and in ve ntory o ptimization ( Oroojlooyjadid et al. 0 )—on one hand, and the importance o f multi-agent system ( W ang et al. 2016b , Leibo et al. 2017 ) on the ot her h and , se veral researches hav e been focused on deep MARL. One naiv e approach to solve th ese prob l ems is to con ve rt the problem to a single-agent problem and make the decision for all the agents us i ng a centrali zed controller . Howe ver , in th is approach, the num ber o f actions typically exponentially increases, which makes th e problem intractable. Besides, each agent needs to send it s local information to the central controll er and with increasing the n umber of agents, this approach becomes very expensiv e or impossib le. In addit ion to the commun i cation cost, this approach is vulnerable to the presence of the central unit and any incident that results in the loss of the network. Moreover , us ually in multi-agent p rob lems, each agent accesses only some local information, and due to priv ac y issues, th e y m ay not be allowed to share their information with t he other agents. There are sever al properties of the system t hat is important in modeling a m ulti-agent sy s tem: (i) centralized or decentralized control, (ii ) fully or partially observable environment, (iii) coo p era- tiv e or competitive en vironm ent. W ithin a centralized controller , a central un it t ake s the decis i on for each agent in each ti me step. On the other hand, in the decentralized system, each agent takes a decis i on for its elf. Also, the agents might cooperate t o achie ve a comm on go al, e.g. a group of robots who w ant to identify a source or they mig h t compete with each other to max- imize their o wn re w ard, e.g. th e players in dif ferent teams of a game. In each o f these cases, the agent might be able to access t he whole i nformation and t he sensory o bserv ation (if any) o f the other agents, or on the ot her hand, each agent migh t be able t o obs erv e only its local i n for - mation. In th i s paper , we h ave focused on the decentralized problem s wit h the cooperative goal, and most of th e relev ant p apers wi th eit her ful l or partial observability are revie wed. Note that W eiß ( 1995 ), Matignon et al. ( 2012 ), Bu ¸ soniu et al. ( 2010 ), Bu et al. ( 2008 ) p rovide re views on cooperativ e gam es and general MARL algorith ms publis hed till 201 2 . Also, Da Silva and Costa ( 2019 ) provide a s u rve y over th e u tilization o f transfer learning in M ARL. Zhang et al. ( 2019b ) 2 provide a com prehensi ve overvie w on th e theoretical result s, con ver gence, and complexity analy- sis of M ARL alg orithms on Markov/stochastic games and extensi ve-form gam es on competitive, cooperativ e, and mixed en vironments. In the cooperativ e s etting, they have most l y focused on the theory of consensus and policy ev aluation. In this paper , we did not l imit ourselves to a giv en branch o f cooperative MA RL such as consens u s , and we tried to cover m ost of the recent works on th e cooperativ e Deep MARL. In Nguyen et al. ( 2020 ), a revie w is p rovided for MARL, where the focus is on d eep M ARL from the foll o wing perspectiv es: non-stationarity , partial observability , continuous state and action spaces, training schemes, and transfer learning. W e provide a com- prehensiv e over view of current research directions on the cooperative MARL un der six categories, and we tried our best to unify all papers through a single no tation. Since the problems th at MARL algorithms deal w i th, us u ally include large state/action spaces and, the classical tabular RL algo- rithms are not ef ﬁcient to s olve them, we m ostly focus on the app rox imated cooperativ e M ARL algorithms. The rest of the paper is organized as the following: in section 2 , we discuss the taxo nomy and organization of the MA RL alg o rithms we re viewe d. in Section 3.1 we brieﬂy explain the single agent RL problem and some of its components. Then the multi-agent formul ation is represented and some of the main challenges of multi-agent en vi ronment from th e RL viewpoint is described in Section 3.2 . Section 4 explains the independent Q-learner type algorit hm, Section 5 revie ws the papers with a fully obs ervable critic mod el, Section 6 includes the value decomposi t ion papers, Section 7 explains t he consensus approach, Section 8 reviews the learn-to-communicate approach, Section 9 explains some of the emerging research directions, Section 10 provides some app l ications of the multi -agent problems and MA RL algorithm s in real-world, Section 11 very b ri eﬂy ment ions some of the av ailable multi-agent environments, and ﬁnally Section 13 concludes the paper . 2 T axonomy In this section, we provide a high-lev el explanation about the taxo nomy and the angle we looked at t he MARL. A sim p le approach to extend si ngle-agent RL algorithms to multi-agent algorithms is to consider each agent as an independent learner . In this setti n g, the other agents’ action s would be treated as part of the en vironm ent. This i dea was formalized in T an ( 1993 ) for the ﬁrst time where the Q- learning algorithm was extended for this problem which is call ed i ndependent Q-Learning (IQL) . The b iggest challenge for IQL is non-stationa r ity , as the other agents’ actions tow ard local interests will i mpact the en v ironment transit i ons. 3 T o address the no n-stationarity issue, one s trategy is to assume that all critics observe the same state (global state) and actions of all agents, which we call i t fully observable critic model. In this setting, the critic model learns the t rue state-value and when is paired w i th t he actor can be us ed tow ard ﬁnding the optimal pol i cy . When the re ward is shared among all agents, only on e critic model is required; howe ver , i n the case of priv ate local rew ard, each agent needs to train a local critic model for itself. Consider multi-agent problem settings, where the agents aim to maxim ize a singl e joint rew ard or assume it is p ossible to reduce the m ulti-agent problem into a s ingle agent problem. A usual RL algorithm may fail to ﬁnd the glob al optimal sol ution in this sim pliﬁed s etting. The m ain reason is that in s uch setti n gs, the agent s do no t know the true share of the rew ard for their actions and as a result , som e agents may get l azy over time ( Sunehag et al. 20 18 ). In addition, exploration among the agents wit h poor p olicy aggrav ates the t eam re ward, thus restrain the agents wit h good policies to proceed toward opti mal policy . One idea to remedy t h is issue is t o ﬁgure out the s h are of each individual agent into the g l obal re ward. This so l ution is formali zed as the V alue F u nction F actori z ation , where a decomposition fun ction is learned from the global reward. Another drawback in t he fully observable critic paradigm is the communication cost. In parti c- ular , wi th increasing the nu m ber of learning agent s , it might be a prohibitive task to col l ect all state/action information in a critic due to communi cation b and w i dth and m emory limit ations. The same issue occurs for the actor when the ob s erv ation and action s are being shared. Therefore , the question is how to add ress this communication issue and change the t o pology such that the local agents can cooperate and com municate i n learning opt i mal policy . The key idea is to put the learning agents on a sparsely connected network, where each agent can communicate with a small subset of agent s . Then the agents seek an optimal solution under the constraint that this soluti on is in consensus wi th its neighbors. Through communicatio ns, eventually , t he whole network reaches a un ani mous policy which results i n the optimal policy . For the consensus algorithms or th e ful l y observable critic model, it is assumed that t h e agent can send their observ ation, action, or re wards to the other agents and t he h o pe is that th e y can learn the opt imal p o licy by ha ving that informati on from other agents. But, one does not know the true information that is required for the agent to l earn t h e optim al policy . In o ther words, the agent might be able to learn the optimal po licy by sendi n g and receiving a simple m essage instead of sending and recei ving the whole ob s erv ation, action, and rew ard in fo rm ation. So, another line of research which is called, Learn to Communicate , all ows the agents to learn what t o send, when to send, and s end that to which agents. In particular , besides t he action to the en vironment, the agents learn anot her action called com munication action. 4 Despite the fact that the above taxonomy cover s a big portion of MARL, th ere are still some algorithms that eith er do not ﬁt i n any of these categories or are at the intersectio n of a few of them. In t his revie w , we discuss some of these algorithm s to o . In T able 1 , a su m mary of different categories are presented. Notice that i n the third column we provide only a few representativ e references. Mo re papers will be discussed in the fol lo wing sections. Categor ie s Description References Indepe n dent Learners Each learning agent is an indep endent learner without consider ing its inﬂuence into the en vironm ent of oth er agents T an ( 19 93 ), Fuji et al. ( 2018 ), T amp uu et al. ( 20 17 ), Foerster et al. ( 2017 ) Fully Observable Critic All critic models observe the sam e state (global state) in order to add r ess the non-station arity issue Lowe et al. ( 20 17 ), Ryu et al. ( 2018 ), Mao et al. ( 2019 ) V alue Fu n ction Factorization Distinguish the share of each individual agen t into the global rew ard to av oid h a ving lazy agents Sunehag et al. ( 201 8 ), Rashid et al. ( 201 8 ), Son et al. ( 2019 ) Consensus T o av oid co mmunication overhead, agen t commun icate through sparse c o mmunication network an d try to re a c h consensu s Macua et al. ( 201 8 ), Zhang et al. ( 2018 c ), Cassano et al. ( 202 1 ) Learn to Commun icate T o impr ove the comm unication efﬁciency , we allow the a gents to learn wha t to send, when to send, and send that to which agents Foerster et al. ( 2016 ), Jorge et al. ( 20 16 ), Mordatch and Abbeel ( 201 8b ) T able 1: A summ ary of the MARL taxon o my in this paper . 3 Backgr ound, Single-Agent RL Formulation, and Multi-Agent RL Notation In this section, we ﬁrst go over s o me background of reinforcement learning and the comm on approaches to solve that for the single-agent problem in Section 3.1 . Then, in Section 3.2 we introduce th e notation and deﬁnition of the Multi-agent s equential decision-m aki ng prob lem and the challenges that MARL algorithms need to address. 3.1 Single Agent RL RL considers a sequential decisi on maki ng problem in which an agent interacts with an en viron- ment. The agent at time period t obs erv es st ate s t ∈ S in which S is the st ate space, takes action a t ∈ A ( s t ) where A ( s t ) is the valid acti on space for st ate s t , and exec utes th at in the en vironment to recei ve rew ard r ( s t , a t , s t +1 ) ∈ R and t hen transfer to the new state s t +1 ∈ S . Th e process runs for the s tochastic T time-steps , where an episode ends. Markov Decision Process (MDP) provides a framework to characterize and st u d y this problem where the agent has full observ ability of the state. 5 The goal of t h e agent in an M D P is to determine a policy π : S → A , a mapping of the state space S t o the action space A 1 , t hat maximizes the long-term cumulative discou n ted re wards: J = E π ,s 0 " ∞ X t =0 γ t r ( s t , a t , s t +1 ) | a t = π ( . | s t ) # , (1) where, γ ∈ [0 , 1] is the discounti ng factor . Accordingly , the value function s tarting from state s and following poli c y π denoted by V π ( s ) is given by V π ( s ) = E π " ∞ X t =0 γ t r ( s t , a t , s t +1 ) | a t ∼ π ( . | s t ) , s 0 = s # , (2) and giv en actio n a , t he Q-value is deﬁned as Q π ( s, a ) = E π " ∞ X t =0 γ t r ( s t , a t , s t +1 ) | a t ∼ π ( . | s t ) , s 0 = s, a 0 = a # , (3) Giv en a known state t ransition probabi lity distribution p ( s t +1 | s t , a t ) and reward matrix r ( s t , a t ) , Bellman ( Bellman 1957 ) showed that the fol l o wing equatio n holds for all state s t at any ti m e step t , i ncluding the optimal values too: V π ( s t ) = X a ∈A ( s t ) π ( a | s t ) X s ′ ∈S p ( s ′ | s t , a ) [ r ( s t , a ) + γ V π ( s ′ )] , (4) where s ′ denotes the successor s tate of s t ; whi ch will be used interchangeably with s t +1 throughout the paper . Thorough maximizing over t he actions , the op t imal state-value and optimal policy can be o b tained: V π ∗ ( s t ) = max a X s ′ p ( s ′ | s t , a )  r ( s t , a ) + γ V π ∗ ( s ′ )  . (5) Similarly , the optim al Q -value for each state-action can be obtained b y: Q π ∗ ( s t , a t ) = X s ′ p ( s ′ | s t , a t ) h r ( s t , a t ) + γ max a ′ Q π ∗ ( s ′ , a ′ ) i . (6) One can ob t ain an optim al policy π ∗ through learning directly Q π ∗ ( s t , a t ) . The relev ant methods are called value-based method s. Howe ver , in the real world usually the k no wledge of the en viron - ment i.e., p ( s ′ | s t , a t ) is not a va ilable and on e cannot o b tain the optimal policy u sing ( 5 ) or ( 6 ). In order t o address this issue, learning the s tate-v alue, or the Q-value, through sampling has b een a common practice. This approximatio n requires only samp les of st ate, action, and re ward that are obtained from the interaction with the en vironment. In the earlier approaches, the value for each state/state-action was stored in a table and was u pdated t h ro u g h an iterativ e approach. The valu e 1 A policy can be deterministic or stochastic. Action a t is the direct outcome of a deterministic policy , i . e., a t = π ( s t ) . In stochastic policies, the outcome of the polic y is the probability of choosin g each of the actions, i.e., π ( a | s t ) = P r ( a t = a | s t ) , and then an additional method is required to choose an action among them. For example, a greedy method chooses the action with the highest probability . In this paper , when we refer to an action resulted from a policy we mostly use the notation for the stochastic policy , a ∼ π ( . | s ) . 6 iteration and policy iteration are two famous algorithm s in t his category that can attai n the opti mal policy . Although, these approaches are not practical for tasks with enormous state/action spaces due to the curse of dim ensionality . This is sue can be mi tigated through functio n approximation , in which parameters of a function need to be learned by utilizin g supervised learning approaches. The function approximator with parameters θ results in poli c y π θ ( a | s ) . The function approximator with parameters θ can be a simple linear regression model or a deep neural n etwork. Give n th e function approximator , the g o al of an RL algorit hm can be re-written to m aximize the utility function, J ( θ ) = E a ∼ π θ ( . | s ) ,s ∼ ρ π θ ∞ X t =0 γ t r ( s t , a t ; θ ) , (7) where the expectation i s taken over the actions and the distribution for state occupancy . In a different class of approaches, called policy-based , the policy i s directly learned wh i ch deter- mines the prob ability of choosing an action for a given state. In either of these approaches, th e g oal is to ﬁnd p arameters θ to maxim ize uti l ity function J ( θ ) through learning with sampling. W e ex- plain a bri ef overview of the value-based and policy-based approaches in Sections 3.1.1 and 3.1.2 , respectiv ely . Note t hat there is a large body of l iterature on th e s i ngle-agent RL algorit hms, and we only revie wed those algorithm s which are activ ely being used in mu lti-agent literature too. So, to keep the coherency , we prefer not to explore advanced algorithms like soft actor -critic (SA C) ( Haarnoja et al. 2018 ) and TD3 ( Fujimoto et al. 2018 ), which are not so common in MARL litera- ture. For more details about other si ngle-agent RL algorithms see Li ( 2017 ) and Sut t on and Barto ( 2018 ). For all the abov e notatio n s and descriptions, we assum e full observability of the en vironment. Howe ver , in cases that the agent accesses only som e part of the state, it can be categorized as a decision-making problem with partial observ ability . In such circumst ances, MDP can no longer be used to model the prob l em; inst ead, parti ally observable MDPs (POMDP) is i n troduced as the modeling frame work. This sit uation happens in a lot of mult i-agent sy stems and will be di scussed throughout th e paper . 3.1.1 V alue Appr oximation In value approximatio n , the goal is t o learn a function to estim ate V ( s ) or Q ( s, a ) . It is showed that with a large enough number o f observation s, a lin ear function approximation con ver ges to a local optim al ( Bertsekas and Tsit s iklis 1996 , Sutto n et al. 2000 ). Ne vertheless, the li near functi on approximators are not powerful enough to capture all complexities of a compl ex en v i ronment. T o address this issue, t h ere has been a tremendous amount of research on non -l i near function approx- imators and especially neural networks in recent years ( Mnih et al. 201 3 , Bhatnagar et al. 2009 , Gao et al. 2019 ). Among the recent works on th e value-based method s , the deep Q-network (DQN) 7 algorithm ( Mnih et al. 2015 ) attracted a l o t of attention due to its hum an-le vel p erformance on the Atari-2600 games ( Bellemare et al. 2013 ), in which it only observes the video of the gam e. DQN utilizes the experience replay ( Lin 1 9 9 2 ) and a moving targe t n et work to s tabilize the training. The experience replay ho lds the previous observation tu p le ( s t , a t , r t , s t +1 , d t ) in which d t determines if the episode ended with this observation. Then the approxi mator is trained by taking a random m i ni- batch from the experience replay . Utilizing the experience replay results in sample ef ﬁciency and stabilizing the training , since it breaks t he temp o ral correlations am ong consecutive observations. The DQN algorithm uses a deep neural network t o approximate the Q-value for each possibl e ac- tion a ∈ A . The i n put of the network is a fun ct i on of the state s t , e.g., concatenation/aver age of the last four observed states. Th e ori g inal p aper used a combination of con v olutional neural network (CNN) and fully connected (FC) as th e n eural network approximator; alt h ough, it can be a linear or non-linear functio n approxi mator , li ke any combin ation o f FC, CNN, or recurrent neural networks (RNN). The original DQN algorithm uses a function of state s t as the i n put to a CNN and its out p ut is the input to an FC neural network. Thi s neural network with weight s θ is then trained by taking a m ini-batch of size m from the experience replay to mini mize th e following loss function: L ( θ ) = 1 m m X i =1 ( y i − Q ( s i , a i ; θ )) 2 , (8) y i =    r i , d t = T rue , r i + γ max a ′ Q ( s ′ i , a ′ , θ − ) d t = False , (9) where, θ − is the weights o f the tar get network w h i ch is updated by θ every C iterations. Later , new DQ N-based approaches were proposed for sol v ing RL probl ems. For example, i n spired by Hasselt ( 2010 ) which proposed double Q-learning, the Double-DQN algorithm proposed in V an Hass elt et al. ( 2016 ) to alleviate the over -estimation issu e o f Q-values. Sim ilarly , Dueling dou- ble DQN ( W ang et al. 2016c ) propos ed learning a network with two heads to obtai n the advantage value A ( s, a ) = Q ( s, a ) − V ( s ) and V ( s ) and use that to get the Q-values. In addition, ano t her extension of the DQN algorithm is proposed by combining recurrent neural n et works wit h DQN. DRQN ( Hausknecht and Stone 2015 ) is a DQN alg o rithm which us es a Long Short-T erm Memory (LSTM) ( Hochreiter and Schmidh uber 1997 ) layer inst ead of a fully connected network and is ap- plied to a p artially observable en vironment. In all these v ariants, usually , the ǫ -Greedy algorithm is used to ensure the exploration. That is, in each time-step with a probability of ǫ the actio n is chosen randomly , and otherwis e, it is selected greedily by t aki ng an argmax over the Q -values for the state. T ypi cally , the value of ǫ is annealed during the trainin g. Choosing the hyper-parameters of the ǫ -greedy algorithm, target u pdate frequency , and the experience replay can widely affe ct the speed and qual i ty of the traini n g. For example, it is shown that a l ar ge experience replay buf fer 8 can negati vely affec t performance. See Zhang and Sutton ( 2017 ), Liu and Zou ( 2018 ), Fedus et al. ( 2020 ) for more details. In the ﬁnal p o licy , which is us ed for scoring, usuall y the ǫ is set to zero, which results in a d et erm i n- istic poli c y . This m o s tly works well in practice; alt h o ugh, it might not be applicable t o stochastic policies. T o address this issue, softm ax operator and Boltzmann softmax operator are added to DQN ( Lipton et al. 2016 , Pan et al. 202 0 ) to get the probabili ty o f choosin g each action, Boltzmann ( Q ( s t , a )) = e β Q ( s t ,a ) P a ∈A ( s t ) e β Q ( s t ,a ) , ∀ u ∈ A ( s t ) , in which β is t he temp erature parameter to control the rate of sto chasticity of actions . T o use the Boltzmann softm ax operator also one needs to ﬁnd a reasonable temperature parameter value and as a result cannot b e used to ﬁnd the g eneral o ptimal stochastic poli cy . T o see other extensions of the v alue-based algorit h m and other exploration algorith m s s ee Li ( 2017 ). 3.1.2 Policy Appr oximatio n In the value-based m ethods, the key idea i s learning the optim al value-function or t he Q-funct i on, and from there a greedy policy coul d be obt ai n ed. In anot h er di rectio n , one can parametrize the policy , m ake a utilit y functio n, and try to optimize this functi on over the policy parameter t h rough a supervised learning process. This class of RL algorithms is called policy-based methods, whi ch provides a probability distribution over actions. F or example, in the policy gradi ent method, a stochastic p olicy by parameters θ ∈ R d is learned. First, we deﬁne h ( s t , a t ; θ ) = θ T φ ( s t , a t ) , where, φ ( s t , a t ) ∈ R d is called the featur e vector of the state-action pair ( s t , a t ) . Then, t he sto chas- tic policy can be obtained by: π ( a t | s t ; θ ) = e h ( s t ,a t ; θ ) P b e h ( s t ,b ; θ ) , which is the softm ax function. T h e goal is to directly learn t he parameter θ using a gradient- based algori t hm. Let us deﬁne J ( θ ) to measure the expected value for policy π θ for trajectory τ = s 0 , a 0 , r 0 , . . . , s T , a T , r T : J ( θ ) = E τ T X t =0 γ t r ( s t , a t ) Then the policy gradient theorem provides an analytical expression for t h e gradient of J ( θ ) as the following: ∇ θ J ( θ ) = E τ [ ∇ θ log p ( τ ; θ ) G ( τ )] , (10) = E a ∼ π θ ( . | s ) ,s ∼ ρ π θ [ ∇ θ log π ( a | s ; θ ) Q π θ ( s, a )] . (11) 9 , in which G ( τ ) is the return of the t rajectory . Then th e policy-based methods upd ate the parameter θ as below: θ t +1 = θ t + α \ ∇ θ J ( θ ) , (12) where, \ ∇ θ J ( θ ) is an approximati on of th e true gradient, and α is the learning rate. Depend s on how to estim ate ∇ θ log p ( τ ; θ ) , log π ( a | s ; θ ) , or G ( τ ) , there are sev eral variants o f policy gradient algorithms. In the fol lo wing, we explore some of these v ariants which are mostly used fo r MARL algorithms. REINFORCE algo ri t hm is one of the ﬁrst pol icy-gradient alg o rithms ( Sutton et al. 2000 ). Particularly , REINFORC E applies Monte Carlo method and uses the actual return G t := P T t ′ = t γ t ′ r ( s t ′ , a t ′ ) as an approxi mation for G ( τ ) in equation ( 10 ), and re-writes the gradient ∇ θ log p ( τ ; θ ) as P T t =0 ∇ θ log π θ ( a t | s t ) . This provides an unbias ed estim at i on of the gradient ; although, it has a high variance which makes the train i ng quite h ard in practice. T o reduce the var iance, it has been shown that subtracting a baseli n e b from G ( τ ) is very helpful and is bein g widely used in practice. If the baseline is a function o f state s t , the gradient estimator will be still an unbi ased estim ator , which makes V ( s ) an appeali n g candidate for the baseline function. Intuitively , with a positive G t − V ( s t ) , we want to move t oward the gradients since it result s in a cumulative re ward which is higher than the av erage cumulative re wa rd for that state, and vi ce versa. There are sev eral oth er extensions of REINFORCE algorithms , each tries to minim ize the gradient estimation. Among them, REINFORCE with re ward-to-go and basel i ne usuall y outperforms the REINFORCE wit h baseline. For m ore details abou t other extensions see Sutton and Barto ( 2018 ). REINFORCE us es the actual return from a random trajectory , which mi g ht introduce a high vari- ance into the training. In addition, one needs t o wait until the end of the episode to obtain th e actual cumulative discou nted rew ard. Actor- critic (A C) algo ri t hm extends REINFORCE by elim- inating this constraint and minimizes th e va riance of gradi ent estimatio n . In A C, instead of wait- ing until the end of the episode, a criti c model is used to approximate th e value of state s t by Q ( s t , a t ) = r ( s t , a t ) + γ V ( s t +1 ) . As a natural choice for the baseline, picking V ( s t ) resu l ts in utilizing the adva ntage function A ( s t , a t ) = r ( s t , a t ) + γ V ( s t +1 ) − V ( s t ) . The criti c model is trained by calculating t he TD-error δ t = r ( s t , a t ) + γ V w ( s t +1 ) − V w ( s t ) , and updating w by w = w + α w δ t ∇ w V w ( s t ) , in which α w is the criti c’ s learning rate. Th erefore, there is no need to wait until the end of the episode and after each time-step on e can run a train-s t ep. W ith AC , it is also straig h t-forwa rd to train an agent for non-episodi c en v i ronments. Follo wing this technique, severa l A C-based m ethods were p rop osed. Asynchronous advantage actor -critic (A3C) contains a master node connected to a few worker nodes ( Mnih et al. 201 6 ). This algorithm runs se veral instances of th e actor-critic m o d el and asynchronously gathers the 10 gradients to update t he weight s of a master node. Afterward, the master node broadcasts t he new weights to the worker node, and i n this way , all nodes are updated asynchronousl y . Synchronous adva ntage actor-critic (A2C) algorithm uses the same framework b ut s ynchronously updates th e weights. Neither REINFORCE, A C, A2C, and A3C guarantee a steady improvement over the ob j ecti ve function. T rus t region pol i cy gradient (TRP O) algorithm ( Schulman et al. 2015 ) is proposed to address this issue. TRPO t ries to o btain new parameters θ ′ with the goal of maxim izing the differ - ence between J ( θ ′ ) − J ( θ ) , w h ere θ is the parameter of the current policy . Und er t h e t rajectory generated from the new policy π θ ′ , i t can be shown that J ( θ ′ ) − J ( θ ) = E τ ∼ p θ ′ ( τ ) " ∞ X t =0 γ t A π θ ( s t , a t ) # (13a) = ∞ X t =0 E s t ∼ p θ ′ ( s t )  E a t ∼ p θ ′ ( a t )  p θ ′ ( a t | s t ) p θ ( a t | s t ) γ t A π θ ( s t , a t )  , (13b) ∼ ∞ X t =0 E s t ∼ p θ ( s t )  E a t ∼ p θ ′ ( a t )  p θ ′ ( a t | s t ) p θ ( a t | s t ) γ t A π θ ( s t , a t )  , (13c) where the expectation of s t in ( 13b ) is over the p θ ′ ( s t ) , thoug h we do not have θ ′ . T o address this issue, TRPO approximates ( 13b ) by substi tuting s t ∼ p θ ′ ( s t ) with s t ∼ p θ ( s t ) assumi ng that π θ ′ is close to π θ , resulting in ( 13c ). Un der the assumpt ion | π θ ′ ( a t | s t ) − π θ ( a t | s t ) | ≤ ǫ, ∀ s t , Schulman et al. ( 2015 ) sh o w that J ( θ ′ ) − J ( θ ) ≥ ∞ X t =0 E s t ∼ p θ ( s t )  E a t ∼ p θ ′ ( a t )  p θ ′ ( a t | s t ) p θ ( a t | s t ) γ t A π θ ( s t , a t )  − X t 2 tǫC (14) which C is a function of r max and T . Thus, if ǫ is small enoug h, TRPO guarantees m onotonic improvement under the assumpt ion of t he closeness of the policies. TRPO u s es Kullback–Leibler div er gence D K L ( p 1 ( x ) || p 2 ( x )) = E x ∼ p 1 ( x ) h log p 1 ( x ) p 2 ( x ) i to measure t he amoun t of changes in th e policy u p date. Th erefore, it sets | π θ ′ ( a t | s t ) − π θ ( a t | s t ) | ≤ p 0 . 5 D K L ( π θ ′ ( a t | s t ) || π θ ( a t | s t )) ≤ ǫ and solves a const rained optimization problem. For solving this optim ization problem, TRPO approximates the objective function by using the ﬁrst o rder term of the correspon ding T aylor series. Similarly , the constraint is approxim at ed usin g the s econd order term o f the corresponding T aylor series. This results in a pol i cy gradient update whi ch in volv es calculating the inv erse of the Fisher matrix ( F − 1 ) multi plier and a speciﬁc learning to make sure that t he bound ǫ is hold. Neural networks may have mill ions of parameters, which make i t impossibl e to di rectly achiev e F − 1 . T o address this is s ue, the conjug at e gradient algo ri t hm ( Hestenes et al. 1952 ) i s used. In general, despite TRPO’ s beneﬁts, it is relatively a com plicated algo rithm, it is expensiv e to run, and it is not com p atible with architectures includ ing dropout kin d of n o ises, or parameter sharing among differ ent networks ( Schulman et al. 2017 ). T o address these issues, Schulman et al. ( 2017 ) 11 proposed proximal p o licy optimization (PPO) al g orithm in which it av oids using t he Fisher matrix and its compu t ation burden. PPO deﬁnes r t ( θ ) = π θ ′ ( a t | s t ) π θ ( a t | s t ) and it obtains the gradient up d ate by E τ ∼ p θ " ∞ X t =0 min( r t ( θ ) A π θ ( s t , a t ) , clip ( r t ( θ ) , 1 − ǫ, 1 + ǫ ) A π θ ( s t , a t )) # , (15) which in practice works as good as TRPO in most cases. Despite the issue of data ef ﬁciency , policy-based algorithms provide bett er con ver gence guarantees over t he value-based algorithm s ( Y ang et al. 2018b , Zhang et al. 2020a , Agarwal et al. 2020 ). This is still t rue wi th the policy gradient which u t ilizes neural networks as functi on approxim ation ( Liu et al. 2019 , W ang et al. 20 2 0b ). In addition, compared to the value-based algori thms, pol icy- based approaches can be easil y applied to the cont i nuous control problem. Furthermore, for most problems, we do no t know the true form of the optim al policy , i.e., deterministic or stochastic. T h e policy gradient has the abil ity to learn either a stochasti c or determinis tic poli cy; howe ver , in value- based algori t hms, one n eeds to k n ow the form of the poli c y at the algori t hm’ s design tim e, which might be unknown. Thi s results in two beneﬁts of the policy-gradient meth od over va lue-based methods ( Sutton and Barto 20 18 ): (i) when th e op timal pol icy is a s tochastic poli c y (li ke Tic-tac- toe game), poli c y gradient by nature i s able to learn th at . Ho we ver , the value-based algo rithms hav e no way of learning the op t imal stochast ic pol i cy . (ii) if the optimal policy is determinist ic, by following the poli cy gradient algorit h ms, there i s a chance of con ver ging t o a deterministic policy . Howe ver , with a value-based algorithm, one does not know the true form of the opti m al poli cy so that h e cannot choose t he optimal exploration parameter (like ǫ i n ǫ greedy meth od) to be used in the scoring time. In the ﬁrst beneﬁt, note that one may use a softmax operator over the Q-value to provide the probabili ties for choosing each action; but, the value-based algorithms cannot l earn the probabi lities by them s elves as a sto chastic policy . Similarly , one may choose a non-zero ǫ for the score time, but on e does not know t h e optimal va lue for s uch ǫ , so this method may not resu l t in the optimal p o licy . Also, on t he second beneﬁt note that this issue is not limited to the ǫ -greedy algorithm. W ith added soft-max operator to a value-based algorithm, we get t he probability of choosing each action. Even in this setti ng, t h e algorithm is designed to g et the true values for each action, and there is no known mapping of true values to the optim al probabilities for choosing each action, which does not necessarily result in 0 and 1 actions. Similarly , the other variants of t h e soft-max op erator like Boltzmann so ft max which uses a temperature parameter , do not help either . Although, the temperature parameter can help to get det erm i nism; st i ll in practice we do not if the optimal sol ution is deterministi c to do that. 12 3.2 Multi-Agent RL Notatio ns and Formulation W e denote a multi -agent setting with t uple < N , S , A , R, P, O , γ > , in wh ich N i s the number of agents, S i s state space, A = {A 1 , . . . , A N } is the set of actions for all agents, P is t h e transition probability amo n g the states, R is the rewar d function, and O = {O 1 , . . . , O N } is the set of observations for all agents. W ithi n any t y pe of th e en vironment, we u se a to denot e t he vector of actions for all agents, a − i is the set of all agents except agent i , τ i represents observation- action h i story of agent i , and τ is t he observation-action of all agents. Al so, T , S , and A are the observa tion-action space, s tate space, and action space, respectively . Then, in a cooperative problem with N agents with full ob s erv ability of the en vironment, each agent i at time-step t observes the global state s t and uses t h e local stochastic policy π i to take action a t i and then receives re ward r t i . If the en vironm ent is full y cooperative , at each time st ep all agents observe a joint reward value r t , i.e., r t 1 = · · · = r t N = r t . If the agents are no t able to fully observe th e state of the syst em, each agent only accesses i t s own local o bserv ation o t i . Similar to the singl e-agent case, each agent is able to learn the optimal Q-value or the optimal stochastic policy . Howe ver , since the p olicy of each agent changes as the training progresses, the en vironm ent becom es non-stat i onary from the perspectiv e of any individual agent . Basically , P ( s ′ | s, a i , π 1 , . . . , π N ) 6 = P ( s ′ | s, a i , π ′ 1 , . . . , π ′ N ) when any π i 6 = π ′ i so that we l ose the underlying assumption of MDP . Thi s means that the experience of each agent in volves differe nt co-player policies, so we cannot ﬁx them and train an agent s uch that any attempt to train such models results in ﬂuctuations of the traini ng. This makes the model training quite challengin g . Therefore, the adopted Bellman equ at i on for MARL ( Foerster et al. 2017 ) (assu m ing the full observability) also does not hold for the mult i-agent sys tem: Q ∗ i ( s, a i | π − i ) = X a − i π − i ( a − i , s ) " r ( s, a i , a − i ) + γ X s ′ P ( s ′ | s, a i , a − i ) max a ′ i Q ∗ i ( s, a ′ i ) # , (16) where π − i = Π j 6 = i π j ( a j | s ) . Due t o the fact t h at π − i changes over time as t h e po l icy of other agents changes, in MARL one cannot obt ai n the optimal Q-value using the classic Bellman equation. On the other hand, the pol icy of each agent changes during the traini ng, which results in a m ix of observations from differ ent policies in the experience replay . Thus, one cannot us e the experience replay without dealing with the non-stationarit y . W ithout experience replay , the DQN algorithm ( Mnih et al. 2 0 15 ) and its extensions can be hard t o train due to the sam p le inef ﬁciency and correla- tion among t he samp l es. The same i s sue exists wi thin A C-based algo ri t hms which use a DQN-like algorithm for the critic. Besides, i n most problems i n MARL, agents are no t able to observe the full st ate o f the system, which are categorized as decentralized POMDP (Dec-POMDP). Due to the partial o bserv ability and the non-stationarity of the local ob serv ations, Dec-POMDPs are e ven 13 harder problems to solve and i t can be shown that they are in the class of NEXP-complete prob- lems ( Bernstein et al. 2 0 02 ). A si milar equati on to ( 16 ) can be obtained for the partially observable en vi ronment too. In the Multi-agent RL, th e noise and v ariance of t he re wards increase which results in the insta- bility of t h e train i ng. The reason is that the re ward of one agent depends on the actio n s of other agents, and the condi t ioned reward on the action of a sing l e agent can exhibit much mo re noise and var iability than a single agent’ s rewa rd. Therefore, training a policy gradient algorit hm also would not be effec tive in general. Finally , we deﬁne the following notation which is us ed in a couple of papers in Nash equi librium. A j oint poli cy π ∗ deﬁnes a Nash equilibrium if and only if: ∀ π i ∈ Π i , ∀ s ∈ S , v ( π ∗ i ,π ∗ − i ) i ( s ) ≥ v ( π i ,π ∗ − i ) i ( s ); ∀ i ∈ { 1 , . . . , N } (17) in which v ( π i ,π − i ) i ( s ) is the expected cumul ative long-term return of agent i in state s and Π i is the set of all possible po licies for agent i . Particularly , it means that each agent prefers not to change its policy if it wants t o attain the long-term cum u lati ve discounted rew ard. Further , if the fol l o wing holds fo r poli cy ˆ π : v ( ˆ π ) i ( s ) ≥ v ( π i ) i ( s ) , ∀ i ∈ { 1 , . . . , N } , ∀ π i ∈ Π i , ∀ s ∈ S , policy ˆ π is called Pareto-optimal. W e int rod uce notati on for Nash equili b rium only as much that we needed for representi n g a few famous papers o n the coop erati ve M ARL. For more d et ai l s on this t opic see Y ang and W ang ( 2020 ). 4 Independ ent Learners One of the ﬁrst proposed approaches to sol ve the multi-agent RL problem is to treat each agent independently s u ch t hat it consi ders the rest of th e agents as part of the environment. T h i s idea is formalized i n independent Q-Learning (IQL) alg o rithm ( T an 19 9 3 ), in which each agent accesses its local observ ation and the ov erall agents try to maxim ize a joint re ward. Each agent runs a separate Q-learning algorit hm ( W atkins and Dayan 199 2 ) (or it can be the ne wer extensions like DQN ( Mnih et al. 2015 ), DRQN ( Hausknecht and Stone 2015 ), etc.). IQL is an appealing algo- rithm s ince (i) it does not ha ve the scalability and commu nication problem that the central control method encounters with increasing the number of agents , (ii) each agent only needs its local his - tory of o b s erv ations during the training and the inference time. Although it ﬁts very well to the partially observable settings , it has t he non-st at i onarity of en vironment issue. Th e tabular IQL usu- ally works well in practice for small size probl ems ( Matignon et al. 2012 , Zawadzki et al. 2014 ); howe ver , i n the case of function approxim ation, especially deep neural network (DNN), it may 14 not work very well. One of the mai n reasons for this weak performance i s the need for the expe- rience replay to stabilize the t raining with DNNs ( Foerster et al. 2017 ). In an extension of IQL, Distributed Q-learning ( Lauer and Riedmill er 2000 ) considers a decentralized fully cooperativ e multi-agent problem such that all agents observe t he full state of the sys tem and do not know the actions of the other agent s , although in the training time it assum es the jo i nt action is ava ilable for all agents. The joint action is exec uted in the en vironment and it returns the joint rew ard that each agent recei ves. This algorith m updates the Q-values only when there is a guaranteed im p rov ement, assuming that the low returns are t he result of a bad exploration of the teammates. In other words, it m aximizes over the possible actions for agent i , assumi ng other agents selected t h e local opti m al action, i.e., for a giv en j oint action a t = ( a t 1 , . . . , a t N ) , it updates the Q-v alues of agent i by: q t +1 i ( s, a ) =    q t i ( s, a ) if s 6 = s t or a 6 = a t , max { q t i ( s, a ) , r ( s t , a t ) + γ max a ′ ∈ A q t i ( δ ( s t , a t ) , a ′ ) } otherwise , (18) in which q t i ( s, a ) = max a = { a 1 ,...,a N } ,a i = a Q ( s, a ) and δ ( s t , a t ) is t he en vironment functi on whi ch results in s t +1 . Therefore, Distributed Q-learning completel y ignores t he low rewa rds th at causes an overestimated Q -values. Th i s iss u e besides the curse o f d imensionality results in poor perfor - mance i n the problems wi th high dimension. Hysteretic Q-learning ( Matignon et al. 2007 ) cons i ders the same problem and tri es to obtain a good policy assuming that th e low return m ight be the result of stochasticity in the en viron ment so that it does not ignore them as Distributed Q-Learning does. In parti cul ar , when the TD-error is positive, it updates t h e Q-values by the learning rate α , and otherwise, it updat es the Q-values by the learning rate β < α . Thus, the model is also robust to negative learning due to the teamm ate explorations. Bo wling and V elos o ( 2002 ) also propose to us e a variable learning rate to i mprove the performance of tabular IQL. In another extension for the IQL, Fuji et al. ( 2018 ) propose t o train one of the agents at each time and ﬁx the po l icy of o ther agents within a periodic manner in order to stabilize the en vi ron ment. So, during the training , other agents do not change their poli cies and the en viron m ent from the view-point of the single agent is stati o nary . DQN algorithm Mnih et al. ( 2015 ) ut ilized experience r eply and tar get network and was abl e to attain super- human lev el control o n most of t h e At ari g am es. The classical IQL uses the tabular version, so one nai ve idea could be using the DQN algorithm instead of each sin g le Q-learner . T ampuu et al. ( 20 1 7 ) impl emented this idea and was one of the ﬁrst papers which took the beneﬁt of t he n eural network as a general p o werful approxi mator in an IQL-like settin g. Speciﬁcally , t his paper analyzes the performance of the DQN in a decentralized two-agent game for both competi- tiv e and cooperati ve settings. They assum e that each agent observes the full state (the video of t he game), takes an action by its own policy and the re ward values are also known to both agents. The paper is m ainly built on the P on g game (from Atari-26 0 0 en viron m ent ( Bellemare et al. 2013 )) in 15 which by changi ng the rew ard function the competitive and cooperativ e b eha viors are obtained. In the competi ti ve version, each agent that drops the ball loses a reward poi nt, and the opponent wi ns the re ward point so that it is a zero-sum game. In the cooperative setti n g, once either of the agents drops t he ball, both agents lose a rew ard poi n t. The num erical results show that in bot h cases the agents are able to learn h ow to play the game very efﬁc iently , that is i n the cooperative s etting, they learn t o keep the ball for long periods, and in t he competitive setting, the agent s learn to quickly beat t he competit or . Experience replay is one of the core elements of the DQN alg o rithm. It helps to stabili ze the training of the neural network and improves th e sam ple efﬁcienc y of the histo ry of observations. Howe ver , due t o the n on-stationarity o f the en vironment, using the experience replay in a mult i- agent en vironment is problematic. Particularly , the pol icy that generates the data for the experience replay is different than the current policy so that th e l earned pol icy of each agent can be misl eading. In order to address this issue, Foerster et al. ( 2016 ) di sable t h e experience repl ay p art o f t h e algo- rithm, or in Leibo et al. ( 2017 ) the old transitions are di scarded and the experience replay uses only the recent experiences. Even though t h ese approaches help to reduce the non-stationarity of the en vi ronment, b ut both li m it the sample ef ﬁcienc y . T o resolve th i s problem, Foerster et al. ( 2017 ) propose two algori t hms t o stabilize th e experience reply in IQL-ty p e algorithms. They consider a ful l y cooperative M ARL with local observation-action. In the ﬁrst approach, each transi tion is augmented wi th the probabili ty o f choosing t he join t action. Then, during the loss calculation, the importance sampli ng correction is calculated us ing th e current policy . Thus, the loss function is changed t o: L ( θ i ) = b X k =1 π t c − i ( a − i , s ) π t i − i ( a − i , s ) h ( y D QN i − Q ( s, a i ; θ i )) 2 i , (19) in whi ch θ i is th e policy parameters for agent i , t c is th e current t ime-step, and t i is the time of collecting i th sample. In this way , the effect of the t ransitions generated from dissi milar policies is regularized on gradients. In the second algorit hm, named FingerPrint, they propose augmenting the experience replay with some parts of the po licies of the other agents. Howev er , the num b er of parameters in DNN is usual l y lar ge and as a result, it is i ntractable in practice. Thus , they propose to augment each instance in the experience replay by the iteration number e and the ǫ of th e ǫ -greedy algorithm. In the numerical experiments, they sh are the weights among t he agent, while the i d of each agent i s also av ai l able as the in put. Th ey provide the results of t wo proposed algorithms plu s the comb i nation of them on the StarCraft game ( Samvelyan et al. 2019 ) and compare the results by a classic experience replay and one no-experience replay algo ri t hm. They conclude that the second algorithm obt ains better results com pared to the o t her algorithm. 16 Omidshaﬁei et al. ( 2017 ) propose another extension of the experience replay for the MARL. They consider mult i -task cooperative games, with independent partiall y obs erv able learners such th at each agent only knows its own action, with a joint rew ard. An algorithm , called HDRQN, is proposed w h ich is based on the DRQN algorithm ( Hausknecht and Stone 2015 ) and the Hysteretic Q-learning ( Matignon et al. 200 7 ). Also, to alleviate the non-stati onarity of MARL, the idea of Concurrent Experience Replay T rajectories (CER Ts) is propos ed, in whi ch the experience replay gathers the experiences o f all agents in any period of one epi sode and also during the sampling of a m ini-batch, it obt ains t h e experiences of one period of all agents together . Since they use LSTM , the experiences in the experience replay are zero-padded (adds zero to the end of the experiments with smaller s i zes to make the s ize of all experiments equal). Moreover , in the mul ti-task version of HRDQN, there are different tasks t hat each has it s own transition probabi lity , observation, and re ward function. During the t raining, each agent observes the task ID, while it is not accessib l e in the inference ti me. T o ev aluate the m o del, a two-player game is utilized, i n which agent s are re warded only when all t he agents sim ultaneously capture the moving tar get. In order t o make the game partially observable, a ﬂickering screen is used su ch that with 30% chance the screen is ﬂickering. The actions of the agents are moving north, so u th, west, east, or waiting. Additionally , actions are noi s y , i.e. wi th 10% prob ability the agent might act differe ntly than wh at it wanted. 5 Fully Observab le Critic Non-stationarity of the en vironment is the main i ssue in multi -agent problems and MARL algo- rithms. One of the common approaches to address thi s issue is using a fully ob serv able crit ic. The fully observable critic i nv olves the observations and actions of all agents and as a result, the en v ironment is stationary ev en though the poli c y of other agents changes. In o t her w ords, P ( s ′ | s, a 1 , . . . , a N , π 1 , . . . , π N ) = P ( s ′ | s, a 1 , . . . , a N , π ′ 1 , . . . , π ′ N ) eve n i f π i 6 = π ′ i , since the en vi - ronment returns an equal next-state regardless of the changes in the policy o f other agents. Follo w- ing this idea, th ere can be 1 or N critic models: (i) in a ful ly cooperativ e problems, one central critic is trained, and (ii) when each agent observes a local rew ard, each agent may need t o train its own criti c model, resulting to N critic models. In either case, once the crit i c is fully observa ble, the non-stationarity of critic is resolved and it can be used as a g o od leader for local actors. Using this idea, Lowe et al. ( 2017 ) propose a model-free mul t i-agent reinforcement l earning algo- rithm to t h e problem in which agent i at time step t of execution accesses its own local observation o t i , l ocal actions a t i , and local rew ards r t i . They cons i der cooperative, competitive, and mixed com- petitive and cooperative games, and propo sed Mu l ti-agent DDPG (M A D DPG) algorithm in which each agent t rains a DDPG algorithm such that the actor π i ( o i ; θ i ) wit h policy weig h ts θ i observes the local o b serv ations while the critic Q µ i is allowed to access the observa tions, actio n s , and th e 17 tar get policies of all agents i n the training time. Then, the criti c o f each agent concatenates all state-actions togeth er as the input and using the local re ward o btains the corresponding Q-value. Either of N critics are trained by m inimizing a DQ N -l i ke loss fun ction: L ( µ i ) = E o t ,a,r, o t +1 h  Q i ( s t , a t 1 , . . . , a t N ; µ i ) − y  2 i , y = r t i + γ Q i  o t , ¯ a t +1 1 , . . . , ¯ a t +1 N ; ¯ µ i  | ¯ a t +1 j = ¯ π ( o t +1 j ) , in which o t is obs ervation of all agents, ¯ π j is the tar get policy , and ¯ µ is the tar get critic. As a re- sult, the critic of each ag ent deals with a sta tionary en vironment , and in the inference time, it only needs to access the lo cal information. MADDPG is compared wit h th e decentralized trained ver - sion of DDPG ( Lillicrap et al. 2016 ), DQN ( Mnih et al. 2015 ), REINFORCE Sutton et al. ( 2000 ), and TRPO ( Schulman et al. 2015 ) algorithm in a set of grou n ded communication en vironments from particle en vironment ( Haarnoja et al. 20 18 ), e.g., predator-pre y , arri v al task, etc. The con- tinues space-action predator-pre y en vironment from this en vironment is usually considered as a benchmark for MARL algorithms with local observation and cooperati ve re wards. In t h e most basic version of the predator-pre y , there are two predators which are randomly pl aced in a 5 × 5 grid, along with one prey whi ch also randomly i s located in the grid . Each predator observes its direct neighbo r cells, i.e., 3 × 3 cells, and th e goal is to catch the prey togeth er to recei ve a re ward, and in all other s ituations, each predator ob t ains a negati ve rew ard. Se veral extensions of MADDPG algori t hm are proposed in the literature and we re view so m e of them i n the rest of this section. Ryu et al. ( 2018 ) propose an actor-cr itic model with l ocal actor and critic for a DEC-POMDP problem, in which each agent observes a local observation o i , observes its own rewa rd r i : S × A 1 × · · · × A N → R , and l earns a deterministic pol icy µ θ i : O i → A i . The goal for agent i is to maximize its own di s counted return R i = P ∞ t =0 γ t r t i . An extension of MADDPG with a generative cooperative pol icy network, called MADDPG-GCPN, is p roposed in which t here is an e xtra actor network µ c i to g enerate action sampl es of other agents. Then, the critic uses the experience replay ﬁlled with sampled actions from GCPN, and not from the actor of the ot h er agents. So, there is no need to share the target policy of other agents during t he training time. Further , the algorithm is modiﬁed in a way such that the critic can us e either immediate individual or joi n t re ward du ring t h e training. They presented a new version of the predator- prey game in wh i ch each agent receiv es an ind ividual re ward plus a shared one if they catch the prey . The e xperimental analysis on the predator-pre y gam e and a controlling ener gy sto rage system s problem s h o ws that the standard deviation of obtain ed Q-va lues i s lower i n MADDPG-GCPN compared to MADDPG. As another extension of MADDPG, Chu and Y e ( 2017 ) consi der multi-agent cooperative prob- lems with N agents and proposes three actor-critic algorit h ms based on M AD D PG. Th e ﬁrst on e assumes that all agents kno w the gl obal reward and sh ares the weight s between agents so that 18 it actually i n cludes one actor and one critic network. The second algorithm assumes the global re ward is no t shared and each agent ind eed updates i t s o wn criti c using the local reward so t hat there are N criti c n et works. Alth o ugh, agents share their weights so that there is only one actor network wh i ch means N + 1 net works are trained. The third alg orithm also assumes non-shared global reward, though uses only two networks, one actor network and one critic network such that the critic has N heads in which head i ∈ { 1 , . . . , N } provides the Q-value of the agent i . They compare the results of their algorithms on th ree new g ames wit h MADDPG, PPO ( Schulman et al. 2017 ), and PS-TRPO algorithms (PS-TRPO i s the TRPO algorit hm ( Schulm an et al. 20 1 5 ) w i th parameter sh aring , see Sukthankar and Rodrigu ez-Agui l ar ( 20 17 ) for more details). Mao et al. ( 2019 ) present another algorithm based o n MA DD PG for a cooperative gam e, called A TT -MADDPG which consi ders the same setting as in Lowe et al. ( 2017 ). It enhances the MAD- DPG algorithm by adding an attention layer in the critic network. In the algori thm, each agent trains a critic which accesses the actions and ob s erv ations of all agents. T o ob tain the Q-value, an at- tention layer is added on the top of th e critic model t o determine the correspondi ng Q-value. In t his way , at agent i , instead of j u st usi ng [ o t 1 , . . . , o t N ] and [ a t i , a t − i ] for time st ep t , A TT -MADDPG con- siders K combinati ons o f possible actio n -vector a t − i , and obt ains th e corresponding K Q-values. Also, us ing an att ention model, it obt ains t he weight s of all K action-sets s uch that the h i dden vec- tor h t i of the attentio n mo del is generated via the actions of ot her agents ( a t − i ). Then, the attent ion weights are used to obtain the ﬁnal Q-value by a weighted sum of the K possible Q-values. Ind eed, their algorithm combines the MADDPG wi th the k-head Q-value ( V an Seijen et al. 2 017 ). They provide some numerical e xperiments on cooperati ve navigation, predator -prey , a packet-routing problem, and compare the performance with MADDPG and few other algorithm s . M oreov er , the eff ect of sm all or lar ge K is analyzed with in each en vironment. In W ang et al. ( 20 1 9 ) again the problem settin g is consi dered as the MADDPG, thou g h they assume a li mit on the commu - nication bandwi d th. Due to this limitati on, the agents are not abl e to share all the informatio n, e.g., th ey cannot share their l ocations on "arriv al task" (from Mult i-Agent Particle Environment ( Mordatch and Abbeel 2018a )), which lim i ts th e ability of the MADDPG algorithm t o solve the problem. T o address this i ssue, th ey propose R-MADDPG, in which a recurrent neural network i s used to remember the last com m unication in both actor and critic. In this order , they modiﬁed the experience replay s u ch that each tu ple includ es ( o t i , a t i , o t +1 i , r t i , h t i , h t +1 i ), in which h i t is the hidden state of th e actor network. The results are com pared with MADDPG over t he "arriv al task" with a communication lim it, where each agent can select either to send a message or not, and the m essage is simpl y the position of the agent . W ith the fully observable state, thei r algorithm works as well as MADDPG. On th e ot her hand, within the partiall y obs erv able en vironment, recurrent acto r (with 19 fully connected critic) does not provide any bett er results than MADD PG; t hough, applying both recurrent actor and recurrent critic, R-MADDPG obtains higher re wards than MADDPG. Since MA D DPG concatenates all the l ocal observations in the critic, it faces the curse of dimen- sionality wi th increasing the num b er of agent s. T o address this issue, Iqbal and Sha ( 2019 ) pro- posed Multiple Actor Attention -Critic (MAA C) algorithm which efﬁciently scales up with the number of agents. The m ain idea in this work i s to use an attentio n m echanis m Choi et al. ( 2017 ), Jiang and Lu ( 2018 ) to select rele v ant inform ation for each agent du ri n g the traini n g. In particular , agent i recei ves the ob serv ations, o = ( o 1 , ..., o N ) , and actions, a = ( a 1 , ..., a N ) from all agents. The value function parameterized by ψ , Q ψ i ( o, a ) , is deﬁned as a function of agent i ’ s observation- action, as well as the inform ation receiv ed from t he other agents: Q ψ i ( o, a ) = f i ( g i ( o i , a i ) , x i ) , where f i is a two-layer perceptron, g i is a one-layer em bedding function, and x i is the cont rib ution of o ther agents. In order to ﬁx the size of x i , it is set equal to the weighted sum of other agents’ observation-action: x i = X j 6 = i α j v j = X j 6 = i α j h ( V g j ( o j , a j )) , where v j is a function of the embedding of agent j , encod ed wi th an em b edding function and then linearly transformed b y a shared matrix V , and h is the activ ation functio n. Denoting e j = g j ( o j , a j ) , using the query-key system ( V aswani et al. 2 0 1 7 ), the attention weight α j is propo rt i onal to: α j ∝ exp ( e T j W T k W q e i ) , where W q transforms e i into a “query” and W k transforms e j into a “key”. The criti c s tep updates the ψ through mi n i mizing the foll o wing loss function: L Q ( ψ ) = N X i =1 E ( o,a,r,o ′ ) ∼ D h ( Q ψ i ( o, a ) − y i ) 2 i , where, y i = r i + γ E a ′ ∼ π ¯ θ ( o ′ ) h Q ¯ ψ i ( o ′ , a ′ ) − α log ( π ¯ θ i ( a ′ i | o ′ i )) i , where ¯ ψ and ¯ θ are the parameters of the tar get critics and tar get pol icies respectiv ely . In order to encourage exploration, they also use the idea of soft-actor-critic (SA C) ( Haarnoja et al. 201 8 ). MAA C is compared with COMA ( Foerster et al. 2018 ) (will be d i scussed shortly), MADDPG, and their up dated version wi th SA C, as well as an independent learner wi th DDPG ( Lillicrap et al. 2016 ), over two en vironments: treasure collection and rover -to wer . MAA C obtains better results than the other algo ri t hms such that the performance gap becomes shorter as t he n umber of agents increases. 20 Jiang et al. ( 2020 ) ass ume a graph conn ection among the agents such that each node is an agent. A p artially observable en vironment is assumed in which each agent observes a local observation, takes a local action, and receives a local rew ard, while agents can share their o bserv ations with their neighbors and the weights of all agents are shared. A graph con v olutional reinforcement learning for cooperative multi-agent is proposed. The mul ti-agent syst em is modeled as a graph s uch that agents are the nodes, and each has s ome features which are the encod ed local obs erv ations. A multi-head attention m odel is used as the con volution kernel to obtain the connection weigh ts to the neighbor nodes. T o learn Q-function an end-to-end algorit hm, nam ed DGN, is proposed, which uses the centralized training and distributed execution (CTDE) approach. The goal is to m aximize the sum of the re ward of all agents. Durin g the training, DGN allows the gradients of one agent to ﬂo w K-neighbor agents and its recepti ve ﬁeld t o stim ulate coo p eratio n. In particul ar , DGN con- sists of three phases: (i) an observation encoder , (ii) a con volutional layer , and (iii) Q-network. The encoder (which i s a simp le MLP , or con v olution layer if it deals with images) recei ves observation o t i of agent i at tim e t and encodes it to a feature vector h t i . In phase (ii ), the con v olutional layer integrates th e local ob serv ations of K neighbors to generate a latent feature h ′ t i . In order to obtain the latent vector , an attention model i s used t o make the inp u t independent of the num ber of input features. The attenti on m odel g ets all feature vectors of t he K -neighbors, generates the attenti on weights, and t h en calculates the weighted sum of the feature vectors to obtain h ′ t i . Another con v olu- tion layer may be added to the model t o increase t h e receptive ﬁeld o f each agent, such that h ′ t i are the inpu t s of that layer , and h ′′ t i are the outcomes. Finall y , in phase (i i i), the Q-network provides the Q-value of each possib le action. Based o n the i dea of DenseNet ( Huang et al. 2017 ), the Q- network gath ers observations and all the latent features and con catenates them for t h e input . (Note that this procedure is followed for each agent, and the weights of the network are shared among all agents.) In the loss function, besides the Q-value loss function, a penalized KL-diver gence is added. This functio n measures the changes between the current attention weight s and the ne xt state attention weights and tries to av oid drastic changes in the attention weight. Using the trained network, in th e exe cution tim e each agent i observes o t i plus the observation of its K neighbors ( { o tot } t j ∈ N ( i ) ) to get an action. The results o f their algorithm are compared with DQN, CommNet ( Sukhbaatar et al. 20 1 6 ), and MeanField Q-Learning ( Y ang et al. 2018a ), on Jungle, Battle, and Routing en vironments. Y ang et al. ( 2018a ) consi ders the m ulti-agent RL problems in the case that there exists a huge num - ber of agents collabo rating or competing wi th each other to opti m ize s o me s peciﬁc long-term cumu - lativ e discounted re wards. They propos e Mean Field R einfor cement Learning frame work, where e very single agent only considers an a verage effe ct of it s neighborhoo d s, instead of exhaustiv e communication with all other agents withi n the pop ulation. T wo algorithms namely Mean Field 21 Q-learning (MF-Q) and Mean Field Actor-C ritic (MF-A C) are d eveloped following the mean-ﬁeld idea. There exists a single state v i sible to all agents, and t he local rewa rd and the l o cal action of all agents are also vi sible to th e ot hers durin g the t rain i ng. Applying T ylor’ theorem, it is prov ed that Q i ( s, a ) can be approxim ated by Q i ( s, a i , ¯ a i ) , where a concatenates the actio n of all agents, a i is t he action of agent i , and ¯ a i denotes the a verage action from the neighbors. Furthermore, utilizing the contraction mapping techniqu e, i t is shown th at mean-ﬁeld Q-values con ver ge to th e Nash Q-values under some particular assumptions . The proposed algorithm s are tested on three diffe rent problems: Gaus s ian Squeeze and the Ising Model (a frame work in s tatistical m echanics to mathematically mod el ferromagnetism), which are cooperativ e problems, and the battle game, which is a mixed cooperativ e-competitive game. The n u merical result s sh ow the effe ctiv eness of the proposed meth o d in th e case of m any-agent RL problems. ( Foerster et al. 2018 ) proposes COMA, a model with a sin g le centralized critic which uses the global state, t he vector of all actions, and a joint rew ard. This critic is sh ared among all agents, while th e actor is trained locally for each agent wi t h the local ob serv ation-action hi story . The joi n t re ward is used to train Q ( s t , [ a t i , a t − i ]) . Then, for the agent i , wi th the ﬁxed a t − i the actor uses a counterfactual baseline b ( s t , a t − i ) = X ˆ a t i π i (ˆ a t i | o t i ) Q ( s t , [ ˆ a t i , a t − i ]) , to obtain cont rib ution of action a t i via advantage function A ( s, a t i ) = Q ( s, [ a t − i , a i ]) − b ( s t , a t − i ) . Also, each actor shares its weights with other agents, and uses gated recurrent unit (GR U) ( Cho et al. 2014 ) to utilize the his t ory of th e observation. Th e y present the resul t s of their algo- rithm o n StarCraft game ( Samvelyan et al. 20 1 9 ) and compare COMA with central-V , central-QV , and two implementati ons of independent actor -critic (IA C). Y ang et al. ( 2020 ) consid er a multi-agent cooperati ve problem in which i n addition to the coop- erativ e goal, each agent needs to attain some personal goals. Each agent observes a l o cal ob- serva tion, local re w ard corresponding to the g o al, and its own histo ry of actions, while the ac- tions are executed j ointly in the en vironment. T h is is a common situation in problems like au- tonomous car d riving. For example, each car has to reach a g i ven destination and all cars need to av oid the accident and cooperate in int ersections. The authors propose an algorith m (central- ized training, decentralized execution) called CM3 , with two phases. In th e ﬁrst phase, one sing le network i s trained for all agents to learn personal goals. The output of this network, a hi d den layer , i s passed t o a gi ven layer of the second network to initiali ze it, and the goal of the sec- ond p h ase is to attain the global goal. Also, since the collective optimal so l ution of all agents is n o t necessarily optimal for ev ery ind ividual agent, a credit assignment approach is proposed to obtain the glo b al solution of all agents. This credit assig n ment, m oti vated by Foerster et al. 22 ( 2018 ), is embedded in the design of t he second phase. In the ﬁrst ph ase of the algorithm, each agent is trained wi th an actor -critic algorithm as a s i ngle agent prob lem learning to achieve the give n goal of the agent. In this order , t he agent is trained wi th some randomly assi gned goal very well, i.e., agent i wants to learn the l o cal policy π to maximi ze J i ( π ) for any arbi- trary giv en goal. All agents share the weigh ts o f the policy , so the model is reduced to m axi- mize J local ( π ) = P N i =1 J i ( π ) . Using the advantage approximation, t he update is performed by ∇ θ J local ( π ) = E h P N i =0 ∇ θ log π i ( a i | o i , g i )( R ( s, a i , g i ) + γ V ( o t +1 i , g i ) − V ( o t i , g i )) i , in which g i is the goal of agent i . The s econd phase starts by the pre-trained agents, and trains a n ew global network in the multi-agent setting to achieve a cooperative goal by a comprehensive exploration. The cooperativ e goal is the sum of local re wards, i.e., J g lobal ( π ) = E π  P ∞ t =0 γ t R t g  , in which R t g = P N i =1 r ( o t i , a t i , g i ) . In th e ﬁrst phase, each agent only observes the part of the state that in v olves the required observation to complete the personal task. In the second phase, addi tional observations are g i ven to each agent to achie ve the cooperati ve goal. This ne w information is u sed to train the centralized critic and also us ed in the advantage function to upd ate the actor’ s policy . The advantage function uses the counterfactual baseline ( Foerster et al. 2018 ), so t hat the global objective is updated by ∇ θ J g lobal ( π ) = E h P N i =0 ∇ θ log π i ( a i | o i , g i )( Q ( s, a, g ) − b ( s, a − i , g )) i . Fi- nally , a combined version of t h e local and global models i s used in this phase to train the model with the centralized critic. They present the experiments on an autonom ous vehicle negotiatio n problem and comp are the results with COMA ( Foerster et al. 2018 ) and independent actor-critic learners m o del. Sartoretti et al. ( 2019b ) extend A3C ( Mn ih et al. 2016 ) algorithm for a centralized actor and a cen- tralized critic. The proposed algorithm is based on centralized training and decentralized execu- tion, in a fully observable en vironment. They cons ider a constructi on problem, TERME S ( Petersen 2012 ), i n which each agent is responsible to gather , carry , and place some blocks to build a certain structure. Each agent observes the gl o bal state plus its own locatio n , takes its own local action, and executes it locally (no jo int actio n selection/execution). Each agent receiv es its local sp arse re ward, such th at the re ward is +1 if it puts down a bl ock in a correct po s ition and -1 if it picks up a block from a correct po sition. Any ot her actions recei ve 0 rewards. During the training, all agents share the weights (as it is done i n A3C) fo r both actor and critic models to train a central agent asynchronously . In t he execution time, each agent uses one copy of t h e learned p o licy with- out commu n icating with other agents. The goal o f all agents i s to achieve the maximum common re ward. Th e learned policy can b e executed in an arbitrary num ber of agents, and each agent can see th e other agent s as moving elements, i.e., as part of the en vironment. Finally , in Kim et al. ( 2019 ) a multi -agent probl em is considered under the fol lo wing t wo assump- tions: 1) The communication bandwidt h among the agents is lim ited. 2) There exists a s hared 23 communication medium such that at each time s t ep onl y a subset of agents is able to use it to broadcast their messages to t he other agents. Therefore, communication scheduling i s required to determine which agents are able to broadcast t heir messages. Utilizin g the proposed frame work, which is called SchedNet, the agents are able to schedu l e thems elves, learn how to encode the re- cei ved messages, and also l earn how to pick action s based on t hese encoded m essages. SchedNet focuses on centralized training and dist rib uted ex ecution. Therefore, i n training the global s tate is a vailable to th e critic, while the actor is local for each agent and the agents are able to communi- cate through the limit ed channel. T o control the com munication, a Mediu m Access Control (MA C) protocol is prop osed, wh i ch uses a w ei g ht-based scheduler (WSA) to determine whi ch nodes can access t he shared medium. The local actor i contains three networks: 1) m ess age encoder , which takes the local ob serv ation o i and outpu ts t he mess ages m i . 2) weigh t generator , w h ich takes t he local observation o i and outpu ts the weight w i . Speciﬁcally , the w i determines the imp o rt ance of observations in node i . 3) action selector . This network recei ves the observation o i , encoded mes- sages m i plus the information from the schedulin g module, which selects K agents who are able to broadcast their m essages. Then maps this information to the actio n a i . A centralized critic is used during the t rain i ng to criti cize th e actor . In p arti cular , the critic receives th e global state of t he en- vironment and the weigh vector W generated by the wight generator networks as an input, and t h e output wi l l be Q ( S, W ) as well as V ( S ) . The ﬁrst one is used to update th e actor wight w i while the second one is used for adjusti ng th e w eig hts of two ot h er networks, namely action s el ector and message encoder . Experimental resul ts o n the predator-prey , cooperative-communication, and navigation task demonstrate that intelligent com m unication s chedu l ing can be help ful in MARL. 6 V a lue Function Factor ization Consider a coo perati ve multi-agent p roblem in which we are allowed to share all information among t h e agents and th ere is no com munication limi tation among t he agents. Further , let’ s assume that we are able t o deal with t he huge action space. In this scenario, a centralized RL app roach can b e used to solve the probl em , i.e., all state obs erv ations are merged together and the problem is reduced to a single agent problem with a com binatorial action space. Howe ver , Sunehag et al. ( 2018 ) shows that naiv e centralized RL m ethods fail to ﬁnd th e glob al optimu m , ev en if we are able to solve the probl ems with such a huge state and action space. The issue comes from the fact that s o m e of the agents may get lazy and n ot learn and cooperate as they are supposed to. This may lead to the failure of th e who le syst em. One possibl e approach to address t his issue is to determi n e the role o f each agent in the joint rewa rd and then som eho w is olate its share o u t of it . This category of alg orithms is called V al u e Function F actoriz ation . 24 In POMDP setti ngs, i f the optimal rew ard-shaping is av ailable, the problem reduces to t rain sev eral independent learners, which sim p liﬁes the learning. Therefore , having a re ward-shaping model would b e appealing for any cooperative MARL. Howe ver , in practice it is not easy to divide th e recei ved reward among the agents since their contribution to the rew ard is not known or it is hard to measure. Following this idea, the rest of th is section discusses the corresponding algorithms. In the li t erature of tabular RL, there are two comm on approaches for rew ard s haping: (i) differ ence r e war ds ( Agogino and T umer 2 0 04 ) which tries t o isolate the re ward of each agent from the joint re ward, i.e. ˆ r i = r − r − i , where r − i denotes the other agents’ share th an agent i in the glob al re ward, (ii) P otential-based r ewar d shapi ng ( Ng et al. 1999 ). In this class of value function fac- torization metho ds, term r + Φ( s ′ ) − Φ( s ) is used instead of mere r , in which Φ( s ) deﬁnes the desirability o f t h e agent to b e at state s . This app roach is also extended for onli ne POMDP settings ( Eck et al. 2016 ) and multi-agent setting ( De vlin and Kudenko 2011 ), t h ough deﬁning the potential function i s challengin g and usuall y needs speciﬁc domain kn o wledge. In o rder to address th i s issue, De vlin et al. ( 2014 ) combine th es e approaches and proposes two t ab ular algori t hms. Following this idea, sever al value funct i on factorization algorithms are proposed to automate reward-shaping and a void the need for a ﬁeld expert, whi ch are summarized i n the following. Sunehag et al. ( 201 8 ) con s ider a fully cooperativ e m ulti-agent problem (so a singl e shared rewa rd exists) in which each agent observes its own st ate and action hi s tory . An algorithm, called VDN, is prop osed to decompos e the value function for each agent. Intuitively , VDN measures t h e impact of each agent on t h e observed joint rewar d. It is assum ed that the joint action-value functi o n Q tot can b e additively decomposed into N Q-functions for N agents, i n which each Q-function only relies on the local s t ate-action hist ory , i.e., Q tot = N X i =1 Q i ( τ i , a i , θ i ) (20) In other words, for the jo i nt observation-action histo ry τ , it assumes v alidity of the in d i vidual- global-max (IGM) condition . Individual action-value functions [ Q i : T × A ] N i =1 satisﬁes IGM condition for the joint action-value function Q tot , i f: arg max a Q tot ( τ , a ) =     arg max a 1 Q 1 ( τ 1 , a 1 ) . . . arg max a N Q N ( τ N , a N )     , (21) in which τ is the vector of local observation of all agents, and u is the vector of actions for all agents. Therefore, each agent observes its local state, obtains the Q-values for its action, selects an action, and then t he sum of Q-v alues for th e selected action of all agents provides t h e total Q-va lue of t he problem. Using t he shared rew ard and the total Q-value, the loss i s calculated and then the gradients are backpropagated int o the networks of all agents. In the numerical experiments, a recurrent 25 neural n etwork wit h dueling architecture ( W ang et al. 2016c ) is used to train the model. Also, two extensions of the model are analyzed: (i) s h ared the policy among the agent, by addi ng the one- hot-code of the agent i d to state input, (ii) adding informat i on channels to share some information among the agents. Finally , VDN is com pared with independent learners, and centralized training, in three versions of the two-player 2D grid. QMIX ( Rashid et al. 2018 ) considers the same p roblem as VDN does, and proposed an alg o ri thm which is i n fact an improvement ov er VDN ( Sunehag et al. 2018 ). As ment ioned, V D N add s some restrictions to hav e the additivity of the Q-value and further shares the action -v alue functio n during the training. QMIX also shares the action-value function during the training (a centralized training algorithm, decentralized execution); h owev er , adds the below constraint to the problem: ∂ Q tot ∂ Q i ≥ 0 , ∀ i, (22) which enforces pos i ti ve weights on the mi xer network, and as a result, it can guarantee (approx- imately) monotonic improvement. Particularly , in this model, each agent h as a Q i network and they are part of the general network ( Q tot ) that provides the Q-v alue of the wh o le game. Each Q i has the same structure as DRQN ( Hausknecht and Stone 2015 ), so it is trained using th e same loss fun ct i on as DQN. Besides the monotoni city cons traint over t he relationship between Q tot and each Q i , QMIX adds some extra inform ation from the gl obal stat e plus a non-linearity in t o Q tot to improve t he s olution qu al i ty . Th e y provide num erical resul ts on StarCraft II and compare the solution wi th VDN. Even th ough VDN and QMIX cover a large domain of multi-agent problems, the assumptions for these two methods do not hold for all problems. T o address this issu e, Son et al. ( 201 9 ) pro- pose QTRAN algorithm. The general settings are the sam e as VDN and QMIX (i.e., general DEC-POMDP problems i n which each agent has its own partial o bserv ation, action history , and all agents share a j o int rew ard). T h e key idea here is that the actual Q tot may be differe nt t han P N i =1 Q i ( τ i , a i , θ i ) . Howe ver , they consider an alternative joint action-value Q ′ tot , assumed to be factorizable by additive decomposition. T h en, to ﬁll the possible gap betw een Q tot and Q ′ tot they introduce V tot = max a Q tot ( τ , a ) − N X i =1 Q i ( τ i , ¯ a i ) , (23) in which ¯ a i is arg max a ′ i Q i ( τ i , a ′ i ) . Give n, ¯ a = [ ¯ a i ] N i =1 , t hey prove that N X i =1 Q i ( τ i , a i , θ i ) − Q tot ( τ , a ) + V tot ( τ , a ) =    0 a = ¯ a ≥ 0 a 6 = ¯ a (24) Based on thi s theory , three networks are built: individual Q i , Q tot , and t h e joint regularizer V tot and three l o ss functions are demon s trated to train the networks. The l ocal network at each agent i s just a 26 regular value-based network, with t he local obs erv ation which provides th e Q-value of all po s sible actions and runs lo cally at the exe cution ti m e. Both Q tot and the regularizer networks use hidden features from the individual value-based network to help sample efﬁc iency . In the experimental analysis, the comparisons o f QTRAN with VDN and QMIX on Multi-dom ain Gaussian Squeeze ( HolmesParker et al. 2014 ) and mod iﬁed predator -prey ( Stone and V eloso 2000 ) is p rovided. W i t hin the cooperative setting with t he absence of jo int rew ard, the rew ard-shaping idea can be applied too. Speciﬁcally , assume at tim e step t agent i observes its own local re ward r t i . In t his setting, Mgu n i et al. ( 2018 ) considers a mu l ti-agent problem in which each agent observes the full state, takes its l ocal action based on a stochastic policy . A general re ward-shaping algorit h m for the multi-agent problem is discussed and proof for obtaining the Nash equilibri um is provided. In particular , a meta-agent (MA) is introduced to modify the agent’ s reward functio n s to g et the con- ver gence to an efﬁcient Nash equil ibrium so lution. The MA init ially does not know the parametric re ward modiﬁer and learns it th rough the training. Speciﬁcally , MA wants to ﬁnd the optimal vari- ables w to reshape the re ward functi o ns of each agent, th o ugh it only observes the correspondi n g re ward of chosen w . W ith a given w , the MARL algorithm can con verge whil e the agents d o not know anything about the MA fun ction. The agents only observe t he assig n ed rew ard by M A and use it t o opti mize their own po licy . Once all agents exec ute their acti ons and recei ve the re ward, the MA receives the feedback and updates the weigh t w . Training t h e MA with a gradient-based algorithm is quite expensive , so in the numerical experiments, a Bayesian op timization with an expected improvement acquisiti on fun cti on is used. T o train the agents, an actor-critic algorith m is used with a two-layer neural net work as th e value network. Th e value network shares its param- eters with the actor n et work, and an A2C algorithm ( Mnih et al. 2016 ) is used to train the actor . It is proved that under a given condition, a rewa rd modiﬁer function exists such that it maxi m izes the expectation of the reward modiﬁer function . In oth er words, a Markov-Nash equilibrium (M-NE) exists in which each agent fol l o ws a policy that provides t h e highest possible value for each agent. Then, con ver gence to the optimal solutio n is prov ed under certain conditions. T o demonstrate the performance of their alg orithm, a problem with 20 0 0 agents is considered in which th e desire location o f agents changes through ti m e. 7 Consensus The idea of the centralized crit ic, whi ch is discussed i n Section 5 , works w ell when there exists a small numb er of agents in the communication network. Howe ver , w i th i ncreasing the num ber of agents, th e volume o f t he information m ight overwhelm t he capacity of a sin g le unit . Moreover , in the senso r-network applicatio ns, in which the information is observed across a large num ber of scattered centers, collecting all this local informatio n to a centralized uni t, under s o me lim itations 27 such as energy li m itation, privac y cons traints, g eographical lim itations, and hardware failures, is often a formid able task. One i d ea to deal wi th this probl em is to rem ove the central uni t and allow the agents to communi cate th rou g h a sparse network, and share the in formation with only a subset of agents, with the goal of reaching a consensus ove r a variable with these agent s (called neigh- bors). Besides the numerous real-world appl ications of this setting, th i s is qu i te a fair assumpt ion in th e applicati o ns of M ARL ( Zhang et al. 2018c , Jiang et al. 2020 ). By limi ting the nu m ber of neighbors to communi cate, the amount of comm u nication remains linear in the order of the num- ber of neighbors. In this way , each agent uses only i ts local observations, t hough uses so m e shared information from the neighbors to st ay tuned wi th the n et work. Further , applying the consensu s idea, there exist several works which prove the con ver gence of the p roposed algorithm s when the linear approxim ators are u tilized. In t he following, we revie w some of th e leading and m o st recent papers i n this area. V arsha vskaya et al. ( 2009 ) study a problem in which each agent has a local observation, executes its local policy , and receiv es a local rew ard. A tabular po l icy optimi zati o n agreement algorithm is proposed which uses Boltzmann’ s law (simil ar to soft-max functio n) to so lve this problem. The agreement (consensus) algorithm assum es th at an agent can send its local re ward, a count er o n the observation, and the taken action per obs erv ation to its neighbors . T he goal of the algorithm is to maximize t he weighted average of the local re wards. In this way , they guarantee that each agent learns as mu ch as a central learner could, and therefore con ver ges to a l ocal optim um. In Kar et al. ( 2013b , a ) the authors p ropose a decentralized multi-agent version of the tabular Q- learning algorithm called QD -learning. In thi s paper , t h e global re ward is expressed as the sum of the local re wa rds, tho u gh each agent is only aware of its o wn local re ward. In t he problem setup, the authors assume that the agents communicate through a t ime-in variant un-directed weakly connected network, to share their observations wi th their direct neighbors. All agents observe the global state and t he global action , and the goal is t o optimi ze the network-a veraged inﬁnite horizon discounted rew ard. The QD -learning works as follo ws. Assume that w e ha ve N agent s in the network and agent i can com municate wi t h its neigh b ors N i . This agent stores Q i ( s, a ) va lues for all poss ible s t ate-action pairs. Each update for the agent i at time t incl u des the regular Q-learning plus t he deviation of t he Q-value from it s neighbor as below: Q t +1 i ( s, a ) = Q t i ( s, a ) + α t s,a  r i ( s t , a t ) + γ min a ′ ∈A Q i t ( s t +1 , a ′ ) − Q i t ( s, a )  − β t s,a X j ∈N t i  Q t i ( s, a ) − Q t j ( s, a )  , (25) where A is t he set of all possi ble actions , and α t s,a and β t s,a are the stepsizes of the QD -learning al- gorithm. It is proved that t h is method con ver ges to th e opti m al Q-v alues in an asympto t ic beha vior 28 under some speciﬁc conditions on the step-sizes. In other words, u nder t he g i ven conditi o ns, they prove that t heir algorithm obt ains t he result that the agent could achieve if the probl em was so lved centrally . In Pennesi and Paschalidis ( 2010 ) a dis trib uted Actor-Critic (D-A C) algorithm is propo s ed under the assum p tion that the stat es, acti ons, and rewards are local for each agent; howe ver , each agent’ s action does not change the other agents’ transitio n models. The crit ic step is p erformed locally , meaning that each agent ev aluates its own po l icy using the local rew ard recei ves from the en vi- ronment. In particul ar , the state-action v alue function is parameterized using a linear function and the parameters are updated in each agent locally using the T emp oral Difference algorithm togeth er with the eligibilit y traces. The Actor step on the other hand is conducted using information ex- change among the agents. First, th e gradient o f the av erage rewar d i s calculated. Then a g radi ent step is performed to i mprove the local copy of the policy parameter along with a consensus step. A con ver gence analysis is provided, und er dimini shing step-sizes, showing t hat the gradient of the avera ge rewa rd functio n tends to zero for e very agent as t he n umber of iterations goes to i n - ﬁnity . In this paper , a sensor network problem w i th mult i ple mob ile nodes has been considered for testing the prop o sed algorit h m. In particular , there are M target poi nts and N mobile s ensor nodes. Whene ve r one node vi s its a tar get poi nt a rew ard will be collected. The ulti mate goal is to train the moving no des in the grid such that the long-term cumulative dis counted rewar d becomes maximized. They have been considered a 20 × 20 grid with three target points and sixteen agents. The n umerical results prove t h at the rew ard improves over tim e while the policy parameters reach consensus. Macua et al. ( 2018 ) prop ose a ne w algorithm , called Diffusion-based Distributed Actor-C ritic (Diff -D AC ) for single and m ulti-task multi-agent RL. In the s ett ing, t here are N agents in the net- work such that there i s one path between any two agents, each is assigned either the same task or a diffe rent task than the others, and the g oal is to m axi mize th e weighted sum of th e value functi ons over all tasks. Each agent runs its own i nstance of the en v ironment with a s p eciﬁc task, witho ut in- tervening with t h e other agents. For example, each agent runs a give n cart-pole problem where the pole l eng t h and its mass are different for differ ent agents . Basically , one agent does not need any information, like state, action, o r rew ard of the other agents . Agent i learns the pol icy wit h param- eters θ i , while it tri es to reach consensus with its neighbors using diffusion strategy . In particular , the Diff-D A C trains multipl e agents in parallel wit h differe nt and/or sim ilar t ask s to reach a single policy t hat performs well on av erage for al l tasks, m eaning that the single policy m ight obtain a high re ward for some tasks but performs poorly for the others. T h e problem formul ation is based on the a verage rew ard, a verage value-function, and ave rage probabili t y transi tion. Based on that, they provide a linear programmi ng formulation of the tabular problem and provide it s Lagrangian 29 relaxation and the duality conditio n to hav e a saddle point. A dual-ascent approach is used to ﬁnd a sadd l e point, in which (i) a primal sol u tion is found for a giv en dual va riable by solving an LP problem, and t hen (ii) a gradient ascend is p erformed i n the direction o f t h e du al variables. These steps are performed iterativ ely to obtain the opti m al solution. Next, the authors propose a practical algorithm utili zi n g a DNN functi o n approximator . During the training of thi s algorit hm, agent i ﬁrst performs an updat e over the weigh t s of the critic as well as the actor network usi ng local infor- mation. Then, a weighted a verage is taken over the weights of both networks which assu res t hese networks reach consensus. The alg orithm is compared wit h a centralized traini ng for the Cart-Pole game. In Zhang et al. ( 2018c ) a multi -agent problem is considered with the following setup. There exists a commo n en viron m ent for all agents and the global s tate s i s av ailable to all of them, each agent takes its own local action a i , and the global action a = [ a 1 , a 2 , · · · , a N ] is av ailable t o all N agent s , each agent receives r i re ward after taking an action and this re ward is visibl e only i n agent i . In this setup, the agents can do time-varying random commu nication with their direct neighbors ( N ) to share some information. T wo A C-based algorithms are p roposed t o solve thi s problem. In the ﬁrst algorithm, each agent has its own l ocal approximation of the Q-function with weights w i , though a fair app rox imation n eeds global re w ard r t (not local re wards r i t , ∀ i = 1 , . . . , N ). T o address this issue, it is assum ed that each agent shares parameter w i with i t s neighbors, and in this way a consens u al est imate of Q w can be achie ved. T o update the criti c, the t em poral diff erence is estimated by δ t i = r t +1 i − µ t i + Q ( s t +1 , a t +1 , w t i ) − Q ( s t , a t , , w t i ) , in which µ t i = (1 − β t w t ) µ t i + β w t r t +1 i , (26) i.e. the moving av erage of the agent i rew ards with parameter β it , and the ne w weights w i are achie ved locally . T o achieve the cons ensus a weighted sum (with weights coming from the consen- sus m at ri x ) of t h e parameters of t he neighbors’ crit i cs are calculated as below: w t +1 i = X j ∈N i c ij ˜ w t j . (27) This weighted sum provides the ne w weights of t h e critic i for the next time step. T o upd at e the actor , each agent observes the global st ate and the local action to updat e its policy; though, during the traini ng, t h e advantage function requi res actions of all agents, as mentio ned earlier . In the critic update, the agents do not s hare any re wards info and neit her actor policy; so, in some sense, the agents keep the privac y of their data and policy . H owev er , they share the actions with all agents, so th at the setting of the problem i s pretty simil ar to MAD DPG algorithm; although, MADDPG assumes the local observation in the actor . In t he second algorithm, besides sharing the 30 critic weigh t s, the critic ob s erv es the moving a verage estimat e for rewards of the neighbor agents and uses that t o obt ain a consensual est imate o f the rew ard. Therefore, this algorithm performs t h e following up date instead of ( 26 ): ˜ µ t i = (1 − β t w t ) µ t i + β w t r t +1 i , in which ˜ µ t i = P j ∈N i c ij ˜ µ t j . Note that in t he second algorithm agents share m ore information among their neighbors. From the theoretical perspecti ve, The authors provide a global con ver- gence proof for both algorithm s in the case of linear function approximatio n. In the num erical results, they provide th e results on two examples: (i) a p roblem with 20 agents and |S|=20, (ii) a compl et ely modiﬁed version of cooperati ve navigation ( Lowe et al. 2017 ) with 10 agents and | S | = 40 , such that each agent observes th e full state and they added a given tar get landmark to cove r for each agent; so agents try t o get clo s er to the certain landmark. They comp are the resul ts of two alg orithms wi th t he case t hat th ere is a s ingle actor-critic model, observing the rewards of all agents, and th e centralized controller is up dated there. In the ﬁrst problem, their algo ri t hms con verged to the same return value that the centralized algorithms achi eve. In the second problem, it used a n eural network and wi t h that n o n-linear approximation and their algo rithms got a small gap compared to the so l utions of the centralized version. In W ai et al. ( 2018 ), a double-ave raging schem e is proposed for the task of policy evaluation for multi-agent prob lems. The setting is following Zhang et al. ( 2018c ), i.e., t he state is gl obal, t h e actions are visible to all agents, and t he rew ards are private and visible on ly to th e local agent. In detail, ﬁrst, the duality theory has been util ized to reformulate the multi-agent policy ev aluation problem, which i s supp o sed to mi nimize the mean squared projected Bellman error (M SPBE) objective , into a con ve x-conca ve with a ﬁnite-sum struct u re opt imization problem. Then, in order to efﬁciently s olve the probl em , the authors combi ne the dyna mic consensu s ( Qu and Li 2017 ) and the SA G algorithm ( Schmidt et al. 201 7 ). Under linear function approximation , it is proved that the proposed algorit hm conv er ges l inearly under some condi tions. Zhang et al. ( 2018b ) cons ider the m ulti-agent problem wi th continuous s t ate and action space. The rest of the setting is sim i lar t o the Zhang et al. ( 2018c ) (i.e., glo bal state, glob al action, and lo cal re- ward). Ag ain, an A C-based algorith m is proposed for this probl em. In g eneral, for the continuous spaces, stochastic poli cies lead to a high v ariance in gradient esti m ation. Therefore, to deal with this issue deterministic p o l icy gradient (DPG) algorithm is proposed i n Silver et al. ( 2014 ) which requires off-polic y exploration. Howe ver , in the setting of Zhang et al. ( 2018b ) the off-policy infor- mation of each agent is n o t known to oth er agents, so the approach used in DPG ( Silver et al. 2014 , Lillicrap et al. 2016 ) cannot be applied here. Instead, a gradient update based on the expec ted policy-gradient (EPG) ( Ciosek and Whi t eson 202 0 ) is proposed, which uses a global estimate of Q-va lue, approx i mated by the consensus update. Thus, each agent shares parameters w i of each 31 Q-va lue esti m ator wi th i t s neighbors. Giv en these assum p tions, the con ver gence guarantees with a linear approximator are provided and t h e performance is compared with a centrally trained alg o - rithm fo r the same p roblem. Follo wing a si m ilar s et t ing as Zhang et al. ( 20 18c ), Suttle et al. ( 2020 ) propose a ne w dist rib uted off -policy actor-critic algorithm, s u ch that t here exists a global state visible to all agents, each agent takes an action which is visi b le to the whole network, and recei ves a local re ward which is av ailable only locally . The main diffe rence between this work and Zhang et al. ( 20 18c ) comes from the fact that the critic step is conducted in an of f-policy s etting usi ng em p hatic temporal diffe rences E T D ( λ ) policy ev aluation m ethod ( Sutton et al. 2016 ). In particular , E T D ( λ ) uses state-dependent discount fa ctor ( γ ) and state-dependent bootstrapping parameter ( λ ). Besides, in this m ethod there exists an interest function f : S → R + that takes into account th e user’ s int erest in speciﬁc s tates. The algorith m steps are as the foll owing: First, each agent performs a consensus step over the critic parameter . Since t he behavior pol i cy is different than the target policy for each agent, they apply importance sampli ng ( Kroese and Rubinst ein 20 1 2 ) to re-weight the samples from th e b eha vior poli cy in order to correspond them to the tar get pol icy . Then, an inner loop starts t o perform anot h er consensus step over the importance sampling ratio. In the next step, a critic update using E T D ( λ ) alg o rithm is performed lo call y and the updated weights are broadcast over the network. Final l y , each agent performs t he actor up date using local gradient information for the actor parameter . Following the analysis provided for E T D ( λ ) in Y u ( 2015 ), the authors proved the con vergence of the prop o sed method for the di strib uted actor-critic method when linear function approx imation is utilized. Zhang et al. ( 2017 ) propose a consensus RL algorithm, in which each agent uses its local observa- tions as well as its neigh bors within a given directed graph. The mu l ti-agent problem is m o deled as a control p rob lem, and a consensu s error is i ntroduced. The cont rol policy i s supposed to min- imize th e consensus error while stabi lizes the system and gets the ﬁnite local cost. A theoretical bound for the consensus error is provided, and the t h eoretical s olution for ha ving the optim al pol- icy is discussed, whi ch i ndeed needs en vironment dynamics. A practical actor-critic algorithm is proposed to im plement the prop o sed algorithm. The practical version in v olves a neural network approximator via a linear activa tion function. The critic measures the local cost of each agent, and the actor network approximates th e control poli cy . The results of their algorithm on leader-tracking communication problem are presented and com pared with th e known optim al sol ution. In Macua et al. ( 201 5 ), an off-policy dist ri b uted policy ev al u ation algorithm is proposed. In this paper , a l i near function has been u s ed to approxi m ate the long-term cumulative di scounted re ward of a given policy (target policy), wh ich is ass umed to be the same for all agents, while dif ferent agents follow differe nt po l icies along the way . In particular , a distributed variant of the Gradient 32 T emporal Di f ference (GTD) algorithm 2 ( Sutton et al. 2009 ) is dev eloped utilizing a primal-dual optimizatio n scenario. In order t o deal with the o f f-policy setting, t hey hav e appl ied th e imp or - tance sampling t echni que. The state s pace, acti on space, and transitio n probabilities are t h e same for ever y node, but their actions do not inﬂuence each other . This assumptio n makes the prob- lem statio n ary . Therefore, the agents do no t need to know the state and th e action of the ot her agents. Regarding th e rew ard, it is assumed that there exists only one global re ward in the prob- lem. First, th ey showed that the GTD algorithm is a sto chastic Arrow-Hurwicz 3 ( Arro w et al. 1958 ) algo rithm applied to th e dual problem of the o ri g inal optimi zation p rob lem. Then, ins p iring from Chen and Sayed ( 2012 ), they proposed a diffusion-based d istributed GTD algorithm . Un d er sufﬁ ciently small b ut constant step-s i zes, they provide a mean-square-error performance analysis which proves that the prop osed algorithms con ver ge to a uni que solution. In o rder to ev alu ate t h e performance of t h e proposed method, a 2-D grid world problem w i th 15 agents is consid ered. T wo diffe rent policies are ev aluated using dist rib uted GTD algorithm. It is sho wn that t he diffusion strategy helps the agents to beneﬁt from the other agents’ experiences. Considering similar setup as Macua et al. ( 2015 ), in Stanko vi ´ c and Stankovi ´ c ( 2016 ) two multi - agent policy e v aluation algo rithms were proposed over a time-varying com munication network. A giv en pol icy is ev aluated u sing the samples derived from diffe rent pol icies in diff erent agents (i.e. off-polic y). Same as Macua et al. ( 2015 ), it i s assumed that the actions of the agents do not interfere wit h each other . W eak con vergence is provided for both algorithms . Another variant of the dist rib uted GTD alg o rithm was proposed in Lee et al. ( 2018 ). Each agent in the network is foll o wing a local policy π i and the goal is to e valuate the global long-term rewa rd, which i s the sum of the local rewa rds. In th i s work, it is assumed t hat each agent can observe t he global join t stat e. A li near functi o n, that combi nes the features of the states, is used to estimate the value functio n . The problem is modeled as a constrained optimizatio n probl em (consensus constraint), and then following the s am e procedure as Macua et al. ( 2015 ), a primal-dual algorithm was p roposed to sol ve it. A rigorous con ver gence analysi s based on the ordinary differential equa- tion (ODE) ( Borkar and Meyn 2000 ) is provided for th e proposed algorit h m. T o keep the stabilit y of the algorithm , they add some box const raints over the variables. Finally , under diminishing step-size, they prove that the dist ri buted GT D (DGTD) con ver ges with probabili ty one. One of the numerical examples i s a stock market problem, wh ere N = 5 differ ent agent s have dif ferent poli - cies for trading the stocks. DGTD i s utilized to estimate the average long-term discounted proﬁt o f all agents . The result s are compared wi t h a si ngle GTD algorithm in the case the sum of the rewar d 2 GTD algorithm i s proposed to stabilize the TD algorithm with l inear function approximation in an of f-policy setti ng. 3 Arrow-Hu rwicz is a primal-dual optimization algorithm that performs the gradient step on the Lagrangian ov er the primal and dual va riables i terati vely 33 is av ailable. The comparison resul t s show that each agent can s uccessfully app roximate the g lobal value functio n. Cassano et al. ( 2021 ) cons ider two different scenarios for the pol icy e va luation task: (i) each agent is following a policy (behavior policy) different than others, and the goal is to ev aluate a target policy (i.e. off-polic y). In this case, each agent has only knowledge about its own state and reward, which is independent of the other agents’ state and rew ard. (ii) Th e state is g l obal and visib le to all agents, the re ward is local for each agent, and t he goal is to ev aluate the target team policy . T h e y propose a Fast Diffusion for Policy Eva luation (FDPE) for t he case with a ﬁni t e d at a set, which combines of f-policy l earning , eligibili t y traces, and l i near function approximation. This alg orithm can be applied to both scenarios mentioned earlier . The main id ea here is t o apply a variance- reduced alg orithm called A VRG ( Y ing et al . 2018 ) over a ﬁnite data set to get a lin ear con ver g ence rate. Further , they modiﬁed the cost functi on to control the b ias term. In particular , they u se h - stage Bellman Equation to deri ve H -truncated λ -weig hted Mean Square Projected Bellman Error ( H λ - MSPBE) compare to the usual cases (e.g. Macua et al. ( 2015 )) w h ere they use Mean Square Projected Bellm an Error (MSPBE). It is s h o wn that the bias term can be controlled through ( H , λ ) . Also, t h e y add a regularization term to t he cost function, wh i ch can be useful in some cases. A d istributed off -policy actor- critic algorit hm is proposed in Zhang and Zavlanos ( 2019 ). In con- trast to the Zhang et al. ( 2018c ), where the actor step is performed locally and t h e consensus up- date is proposed for the critic, in Zhang and Zavlanos ( 2019 ) the critic is performed locally , and the agents asymp totically achie ve consensus on t he actor parameter . The st ate space and action space are continuous and each agent has the l ocal state and action; howev er , the global state and the gl o bal action are visible to all agent s . Both policy function and value function are linearly parameterized. A con vergence analysis is provided for t he proposed algorit hm under di minishing step-sizes for both actor and critic steps. The ef fecti veness of the propos ed metho d was st u died on the distributed resource allocation problem. 8 Learn to Communicate As m entioned earlier in Section 7 , some en vi ron ments allow the comm unication of agents. The consens us algorith m s use t he communi cation bandwidth t o pass raw observation, poli cy weight/gradients, critic weig ht/gradients, or some combination of them. A differ ent approach to use the communication b andwidth is to learn a communication-action (like a mess age) to al- low agents to be able to send information t h at they want. In this way , the agent can learn the time for sending a message, the type of the mess age, and the destinat i on agent s . Usually , th e communication -actio n s do not interfere with the en vironment, i. e., t he mess ages do not affect the next state o r re ward. Kasai et al. ( 2008 ) proposed one of the ﬁrst learning to communicate algo- 34 rithms, in which tabular Q-learning agent s learn m es s ages to com municate with oth er agents in the predator- prey en vironment. The same approach with a tabular RL i s followed in V arsha vskaya et al. ( 2009 ). Besides these early works, there are seve ral recent papers in this area whi ch util ize the func- tion approxim ator . In t his section, we dis cus s some of the more relev ant papers in this research area. In o n e of the most recent works, Foerster et al. ( 20 16 ) consid er a problem t o learn how to comm u- nicate in a fully cooperativ e (recall that in a fully cooperative en vironment, agents share a global re ward) multi-agent setting in which each agent accesses a local ob s erv ation and has a limited bandwidth to communicate to other agents. Suppose that M and U denot e message space and action space respectiv ely . In each time-step, each agent t ake s action u ∈ U which af fects the en- vironment, and decides for action m ∈ M which does not af fect the en vironment and only other agents observe it. T he proposed algorit hm follows the centralized learning decentralized execu- tion paradigm u nder wh ich it is assum ed in the training tim e agents do not hav e any restriction on the communicati on bandwi d t h. They propose t wo main approaches to sol ve this problem. Both approaches use DRQN ( Hausknecht and Stone 2015 ) to address parti al o bserv ability , and disabled experience replay to deal with the non-stationarity . The input of Q-network for agent i at tim e t includes o t i , h t i (the hidd en state of RNN), { u t − 1 j } j , and { m t − 1 j } j for all j ∈ { 1 , . . . , N } . When parameter sharing is u s ed, i is also added i n th e input which helps learn s pecialized networks for agent i withi n parameter s haring. All of the input v alues are con verted into a vector of the same size either by a l ook-up t abl e or an embedding (separate embedding of each input element), and the sum of these same-size vectors is the ﬁnal in put to t h e network. The n etwork returns |M | + |U | outputs for selecting actions u and m . The network includes two layers of GR U, fol lo wed by two MLP layers, and the ﬁnal layer w i th | U | + | M | representing | U | Q-v alues. | M | is different o n two alg orithms and is explained in the following. First t h e y prop ose r einfor ced inter-ag ent l earn- ing (RIAL) algorithm. T o select th e com munication action, th e network includes additional | M | Q-va lues to select discrete action m i t . They also propo sed a practical version of RIAL in which the agents share the policy parameters so that RIAL only needs to learn one n et work. Th e second algorithm is differ entiable inter-ag ent l earning (DIAL), in w h ich the message is continuous and the message receiver provides s ome feedback—in form of gradi ent —to the message sender , to minimize t he DQN loss. In o ther words, the receive r obtains the gradient of its Q-value w .r .t t he re- cei ved mess age and sends it back to the sender so that th e sender knows how to change the message to optimize the Q-value of the receive r . Intuit i vely , agents are rewar ded for th e commu nication ac- tions, if th e receiv er agent correctly i nterprets the m essage and acts upon that. The network also creates a conti nuous vector for t he commun ication action so that there i s no action s el ector for the communication action m t i , and instead, a regularizer un i t dis cretizes it, if necessary . They provide 35 numerical results on t h e swit ch riddle pris o ner game and three communication g ames wit h m nist dataset. The results are compared wit h the no-commu nication and parameter sharing version of RIAL and DIAL methods. Jorge et al. ( 2016 ) extend DIAL in t hree di rectio n s: (i) allow communicati o n of arbitrary s i ze, (ii) gradually increase noise on the communi cati o n channels to make s ure t h at the agents learn a sy m- bolic lang u age, (iii) agents do not share parameters. They provide the results of their algori thm on a version of "Guess Who?" game, in which two agents, namely "askin g" and "answering" , partici- pate. The game i s around g u essing the true i mage that t he answering agent knows, while the askin g agent has n im ages and by asking n/ 2 questions sh ould guess the correct im age. The answering agent returns only "yes/no" answers, and after n/ 2 questions the asking agent guesses the target i m - age. The result of thei r algorithm with d i f ferent parameters is presented. Following a sim ilar lin e, Lazaridou et al. ( 2017 ) considers a problem with two agent s and one round of comm unication t o learn an in terpretable language among the s ender and receiv er agents. The sender recei ves t wo im- ages, whil e it knows the target im age, and sends a m essage to t he recei ver along wi th th e images. If the receiv er guesses the correct image, bot h win a re ward. Thus, they need to learn to communicate through t he message. Each individual agent con ver ts t he image to a vector using VGG Con vNet ( Simonyan and Z i sserman 201 4 ). The sender builds a neural network on top of th e input vector to select one of the av ailable symbol s (values 10 and 100 are used in the experiments) as th e message. The receiv er embeds the m ess age into the s ame size as of the images’ vector , and then throug h a neural n etwork com bines t hem together to obtain the guess. Both agents use REINFORC E algo- rithm ( W i l liams 199 2 ) to train their model and do not share their policies wi th each other . There is not any pre-designed meaning associated with the utilized sym bols. Their results d emonstrate a high success rate and show that the learned com munications are int erpretable. In another work in this direction, in Das et al. ( 2017 ) a fully cooperative two-agent game is considered for the task of image gu essing . In parti cular , two bots , namely a questi oner bot (Q-BO T) and an answerer bot (ABO T) communi cate in natural language and t h e task for Q-BO T is to guess an unseen image from a set of images. At e very round of the game, Q-BO T asks a question, A-BO T provides an answer . Then the Q-BO T updates its informatio n and makes a prediction about t he image. The action space is commo n among both agent s cons i sting of all po s sible output s equ ences u n der a token vocab ulary V , though t he state is l ocal for each agent. F or A-BO T the state includ es the sequence of qu estions and answ ers, the caption p rovided for the Q-BO T besi des th e image it self; while t he state for Q-BO T d o es not include t he im age information . There exists a sing l e re ward for both agents in t his gam e. Similar to Simonyan and Z i sserman ( 2014 ) t h e REINFORCE algorithm ( W i l liams 1992 ) is used to train both agents. Note that Jorge et al. ( 2016 ) al l o w "yes/no" actions within multip l e rounds of communication, Lazaridou et al. ( 201 7 ) consis t of one sin g le round with 36 continuous messages, and Das et al . ( 2017 ) combine them such t h at multip l e rounds of continu o us communication are allowed. Similarly , Mordatch and Abbeel ( 201 8b ) study a joi nt reward problem, in which each agent ob- serves l ocations and com munication m essages of all agents. Each agent has a g i ven goal vector g accessed only priv ately (l i ke moving to or gazing at a given location ), and the goal may in volve interacting wi th other agents . Each agent cho oses one physical action (e.g., moving or gazing to a new location) and chooses one of the K sy m bols from a given vocab ulary list. The symbols are treated as abstract categorical variables without any predeﬁned m eaning and agent s learn to use each sym bol for a given purp o s e. All agents hav e the same action space and s hare their policies. Unlike Lazaridou et al. ( 2017 ) there is an arbitrary number of agents and they do not have any predeﬁned rules, like speaker and l istener , and the goals are not sp eciﬁcally deﬁned such as the correct utterance. The g oal of the m odel is to m aximize the rew ard whil e creating an int erpretable and understandable language for h u mans. T o thi s end, a soft penalt y also i s added to encourage small vocabulary si zes, which results in having mul tiple words to create a meaning. The proposed model uses the st ate variable o f all agents and u s es a fully connected neural network to obtai n th e embedding Φ s . Similarly , Φ c is obtained as an embeddin g of all messages. Then, it combi nes the goal of the agent i , Φ s , and Φ c through a fully connected neural network to ob tain ψ u and ψ c . Then, the physi cal action u is equal to ψ u + ǫ and the commun i cation message is c ∼ G ( ψ c ) , in which ǫ ∼ N (0 , 1) and G ( c ) = − log ( − log( c )) is a Gumb le-softmax estimator ( Jang et al. 2 0 16 ). The results of the algorithm are compared with a no-com munication approach in the mentio ned game. Sukhbaatar et al. ( 201 6 ) consid er a fully cooperative multi-agent problem in which each agent ob- serves a local state and is able to send a continuous communication message to t he other agents. They propose a model, called CommNet, in wh ich a central controller takes t he state observations and the communication m essages of all agents, and runs multi -step comm unications t o provide actions of all agents in the output . CommNet assumes that each agent recei ves the messages of all agents. In the ﬁrst round, t he s tate observa tions s i of agent i are encoded to h 0 i , and t he com muni- cation messages c 0 i are zero. Then in each round 0 < t < K , the cont roller concatenates all h t − 1 i and c t − 1 i , passes them into functio n f ( . ) , which is a l inear l ayer followed by a non-linear function and obtains h t i and c t i for all agent s . T o obtai n the actions, h K i is decoded to provide a distri bution over the action space. Furthermore, they provide a version of the algorithm that assum es each agent only obs erv es the messages of it s neighbor . Note that, compared to Foerster et al. ( 2016 ), commNet allows multiple round s of commun i cation between agent s, and the num ber of agents can be different in differ ent episodes. The performance of CommNet is compared with independent learners, t he fully connected, and di screte communication over a trafﬁ c juncti on and Combat gam e from Sukhbaatar et al. ( 2015 ). 37 T o extend CommNet, Hoshen ( 2017 ) propose V ertex Att ention Interaction Network (V AIN), which adds an attention vector to learn the importance weight of each message. Then, inst ead of concate- nating the messages to gether , the weighted sum of th em is obt ai n ed and used to t ake t h e action. V AIN works well when there are sparse agents who interact with each agent. They compare th ei r solution wi th CommNet ove r severa l en vironments. In Peng et al. ( 2017 ) the auth ors int rod uce a bi-directional communicatio n network (BiCNet) using a recurrent neural network s u ch that heterogeneous agents can communicate with different sets of parameters. Then a multi -agent vec torized version of A C algorithm is proposed for a com bat game. In particular , there exists two vectorized net works, namely actor and critic networks which are shared among all agents, and each com ponent of the vector represents an agent. The po l icy network takes the shared ob s erv ation together with the local information and returns the action s for all agents in the network. The Bi-directional recurrent network is designed in a way to be served as a local memory too. Therefore, each individual agent is capable of maintaini ng i t s own i nternal states besi des sharing the information wi th its neighbors. In each i teration of the algorithm , the gradient of both networks is calculated and the weigh t s of the networks get updated according ly using the Adam algorit hm. In order to reduce t he variance, th ey app l ied the determinis tic of f-policy A C algorithm ( Silver et al. 2014 ). Th e proposed algorith m was applied to the multi-agent StarCraft combat game ( Samvelyan et al. 2019 ). It is shown that BiCNet is able to di scov er seve ral ef fectiv e ways to collaborate during th e game. Singh et al. ( 2018 ) consider the multi-agent probl em in which each agent has a l ocal rew ard and observation. An algorith m called Individualized Controll ed Continuous Communication M o del (IC3Net) is prop o sed to learn to what and when to commun icate , which can be applied to co- operativ e, competitive, and semi-cooperative en vironments 4 . IC3Net allows multip le continues communication cycles and i n each round uses a gating mechanism to decide to communicate or not. The local observation o t i are encoded and passed to an LSTM m odel, which its weights are shared among the agents . Then, the ﬁnal h idden st ate h t i of the LSTM for agent i in time step t is used to get the ﬁnal po licy . A Softmax function f ( . ) over h t i returns a binary acti o n to decide whether t o com m unicate or not . Considering m essage c t i of agent i at time t , the action a t i and c t +1 i 4 Semi-cooperati ve env ironments are those that each agent looks for its own goal while all agents also want to maximize a common goal. 38 are: g t +1 i = f ( h t i ) , (28) h t +1 i , l t +1 i = LSTM  e ( o t i ) + c t i , h t i , l t i  , (29) c t +1 i = 1 N − 1 C X i 6 = j h t +1 j g t +1 j , (30) a t i = π ( h t i ) , (31) in which l t i is the cell state of t he LSTM cell, C is a l inear transformator , and e ( . ) i s an embedd i ng function. Policy π and gating function f are trained usin g REINFORC E algorithm ( W i l liams 1992 ). In order to anal y ze the performance of IC3Net, predator-pray , trafﬁc junction ( Sukhbaatar et al. 2016 ), and StarCraft w i th explore and combat tasks ( Samvelyan et al. 2019 ) are cons i dered. The results are compared with CommNet ( Sukhbaatar et al. 2016 ), no-comm unication m odel, and no- communication model with only global reward. In Jaques et al. ( 2019 ), the authors aim to av o id centralized learning in multi-agent RL problems when each agent observes local s tate o t i , takes a local action, and observes local rewar d z t i from the en vironment. The key idea is to deﬁne a re ward called intrinsic re wa rd for inﬂuencing the other agents’ actions. In particul ar , each agent simu l ates the potential actions that it can take and measures the ef fect on the behavior of ot h er agents by having t h e selected action. Then, t he actions which ha ve a higher ef fect on the action of other agents will be rew arded mo re. Following this idea, the reward function r t i = α z t i + β c t i is used, where c t i is the casual inﬂuence re ward on the other agents, α , and β are some trade-off weights. c t i is computed b y m easurin g the KL d if ference in the policy o f agent j when a i is known or is unknown as below: c t i = X j 6 = i  D K L  p ( a t j | a t i , o t i ) || p ( a t j | o t i )  (32) In order to m easure the inﬂuence rew ard, t wo diff erent scenarios are considered: (i) a centrali zed training, so each agent observes the probabili ty of another agent’ s action for a given count erfactual, (ii) modeling the other agents ’ behavior . The ﬁrst case can be easily handled by t he equation ( 32 ). In the second case, each agent is learning p ( a t j | a t i , o t i ) throu gh a s eparate neural network. In order to train t h ese neural networks, th e agents use the history of observed actions and the cross-entropy loss fun ct i ons. The proposed algorithm i s analyzed on harvest and clean-up en vironments and is compared wit h an A3C baseline and baseline which allows agents to communi cate with each other . This work is p artly relev ant to Theory o f Mind which tri es to explain t he effect o f agents on each other in mult i-agent set t ings. T o see more d etails see Rabinowitz et al. ( 2018 ). Das et al. ( 20 19 ) propose an algorithm, called T arMA C, to learn t o communicate in a m ulti-agent setting, where the agents learn to what to sent and also learn to communicate t o which ag ent . 39 They show that th e learned pol icy is interpretable, and can be extended to competitive and mixed en vi ronments. T o m ake sure th at the message gets enough attention from the intended agents, each agent also encodes some informatio n in the continuous m essage to deﬁne the type of the agent th at the message is intended for . This way , t h e recei ving agent can measure th e rele va nce of the message to itself. The proposed algorithm fol l o ws a centralized training with a d ecentralized e xecution paradigm. Each agent accesses local o b serv ation, observes the m essages of all agents, and the goal is maxi mizing the team re ward R , whil e the discrete actions are executed jointly i n the en vironment. Each agent sends a message includi ng two parts, t he s ignature ( k t i ∈ R d k ) and the value ( v t i ∈ R d v ) . The signature part provides the information of the intended agent to recei ve the message, and the value is the mes s age. Each recipient j receive s all mess ages and learns variable q t j ∈ R d k to receiv e the messages. Multipl ying q t j and k t i for all i ∈ { 1 , . . . , N } resul ts i n the attention weights of α ij for all messages from agents i ∈ { 1 , . . . , N } . Finall y , the aggregated message c t i is the weig h ted sum of the message va lues, which the weights are the obt ained attent ion values. This aggregated message and the local ob s erv ation o t i are the i n put of t he local actor . Then, a regular actor -critic model is trained which uses a centralized critic. Th e actor is a single l ayer GR U layer , and the critic uses the j oint actions { a 1 , . . . , a N } and the hidden state { h 1 , . . . , h N } to o btain th e Q-value. Also, the actors share t h e po l icy parameters to speed up the training, and multi -round com munication is used to increase efﬁcienc y . The proposed method (with n o attenti on, no communicati o n version of the algorithm) is compared wit h SHAPES ( Andreas et al. 2016 ), trafﬁc j unction in which they control the cars, and Ho u se3D ( W u et al. 2018 ) as well as the CommNet ( Sukhbaatar et al. 2 016 ) when it was pos s ible. In the same direction of DIAL, Freed et al. ( 20 2 0 ) p roposed a centralized training and decentralized execution algorit hm based on stochastic message encoding/decoding to provide a discrete communication channel that is mathemati cally equal t o a communication channel with additive noise. The proposed algorithm allows gradients backpropagate through the channel from the recei ver o f the mes s age to the s end er . The base framew ork of the algorithm is somehow sim ilar to D IAL ( Foerster et al. 2016 ); although, unlike DIAL, the proposed algorithm is desig ned to work under (known and unknown) additive communicati on noi ses. In the algorithm, the sender agent g enerates a real-valued m essage z and passes it to a randomized encoder in which the encoder adds a uniform no i se ǫ ∼ U ( − 1 / M , 1 / M ) to the continues message to get ˜ z . Then, to get one of the M = 2 C possible discrete messages, where it is an integer version of the message by mappi n g it into 2 C possible ranges. The discrete message m is sent to the receiv er , where a randomized decoder t ries to reconstruct the original continuous m essage z from m . The function uses th e mapping of 2 C possible ranges to extract message ˆ z , and then deducts a uniform nois e to g et an approxi mation of th e original message. The uniform noise in the decoder is generated from the same distribution which the sender used t o add the noise to the 40 message. It i s prove d that with this encoder/decoder , ˆ z = z + ǫ ′ that mathematically is equiv alent to a syst em where the sender s end s the real-valued message t o the receiv er throug h a channel which adds a uniform noi se from a known distribution to the m essage. In addi tion, they ha ve provided another version of the encoder/decoder functions to handle th e case in which the noise di stribution is unkn o wn and i t is a function of the message and the state v ariable, i.e., ˆ m = P ( . | m, S ) . In the numerical experiments, an actor -critic algorithm is used to train the weights of the networks, in which the critic observes t h e full state of the system and the actors share th ei r weight s . The performance of the algorithm is analyzed on two en vironments: (i) hidden-goal path -ﬁndi n g, in which in a 2D-grid each agent is assigned with a giv en goal cell and needs to arrive at that goal, with 5 actions: m ov e in to four direction s or st ay . Each agent ob serve s its location and the go al of other agents . So, t hey need to ﬁnd out about the location of ot h er agents and t he locatio n of t heir own goal through communication with ot her agents, (ii ) coordinated multi-agent search, where there are two agents in a 2D-grid problem and they are able t o see all g oals only when they are adjacent to the go al or on the goal cell. So, the agents need to commun icate with others to g et some in formation about th eir goals. The results of the proposed algorit hm are compared with (i) reinforced commu n ication learning (RCL) based algorith m (li ke RIAL i n Foerster et al. ( 2016 ) in which th e communication action is t reated like another acti on of t h e agent and is trained by RL algorithms) with noise, (i i ) RCL withou t noise for all cases, (iii) real-va lued message is passed to the agents, (iv) and no-communication for one of th e en vironments. All the papers d iscussed s o far in this section assumed t he existence of a commu nication message and basically , they allow each agent to learn what to send. In a dif ferent approach, Jiang and Lu ( 2018 ) ﬁx the message type and onl y allows each agent to d ecide t o start a communicatio n with the agents in its recepti ve ﬁeld. T h ey consider the problem that each agent observes the local observation, takes a local action, and receives a l o cal re w ard. The ke y idea here is that when there is a l ar ge num ber of agents, sharing the in fo rm ation of all agents and comm u nication migh t not be helpful since i t is hard for the agent to differe ntiate the valuable information from the shared information. In this case, communication might i mpair learning. T o address this i ssue, an algorithm, called A TOC is proposed in which an attentio n unit learns when to integrate the shared information from t he o ther agent s. In A TOC, each agent encodes the local observation, i.e. h t i = µ I ( o t i ; θ µ ) in which θ µ is t he weights of a ML P . Every T time st ep, agent i runs an attenti on unit with input h t i to determine whether to commun icate with the agents in its receptiv e ﬁeld or n ot. If it decides to comm u nicate, a commu n i cation group wit h at mos t m collaborator is created and does not change for T t ime-steps. Each agent in t his grou p sends the encoded informati o n h t i to the comm u nication channel, in w h ich they are combined and ˜ h t i is returned for each agent i ∈ M i , where M i is the list of the selected agents for the commun ication channel. Then, agent i merges 41 ˜ h t i with h t i , passes it t o th e MLP and obt ai n s a t i = µ I I ( h t i , ˜ h t i ; θ µ ) . Note t hat one agent can be add ed in two commu n ication channels, and as a result, t he information can be transferred am ong a larger number of agents. The actor and critic models are trained i n the s am e way as t he DDPG model , and the gradients of the actor ( µ I I ) are also passed in the comm unication channel, if relev ant. Al so, the dif ference of the Q-value with and without communication i s obtain ed and is u s ed to train the attenti on unit. Some numerical experiments on the particle en vironment are done and A TOC is compared with CommN et , BiCNet, and D DPG (A T OC without any communicatio n). Their experiments in v olve at l east 50 agents so that MADDPG algorit hm could not be a benchmark. 9 Other appr oaches and hybrid a l gorithms In thi s section, we discus s a few recent papers which eit her combine the approaches in sections 5 - 8 or p rop ose a m odel that does n ot quite ﬁt in either of the previous sections . Schroeder de W itt et al. ( 2019 ) consider a problem in which each agent observes a local obs erv a- tion, selects an action which is not known to other agents, and recei ves a joint re ward, known t o all agents. Further , it i s assumed that all agents access a common knowledge, and all know that any agent knows t his informati o n , and each agent knows that all agents know t hat all agents know i t and so on. Also, there might be subgroups of the agents who share more common knowledge, and the agents in side each group use a centrali zed policy to take acti on and each agent pl ays its own action in a decentralized model. T ypically , s u bgroups of the agents have more comm on k no wledge and the selected action would result in higher performance than the acti ons s elected by larger groups . So, having g rou ps of smaller size would be interesting. Howe ver , there is a computational trade-off between the selecting s maller or l arger sub groups s i nce there is numerous possible combi n ation of agents to form sm al l er groups. This paper proposes an algorithm t o address this challenge, i.e., divide the agents to a ne w subgroup or take actions via a lar ger joint policy . The proposed algo- rithm, called MA CKRL, provides a hierarchical RL that in each leve l of hierarchy decides either t o choose a joi nt action for th e subgroup or propos e a partition of th e agents into sm aller subgroups. This alg o rithm is very expensive to run since th e number of possibl e joi n ted-agents i ncreases ex- ponentially and the algorithm becomes i ntractable. T o add ress this issue, a pairwi se version of the algorithm is proposed i n which there are three le vels o f hierarchies, the ﬁrst for grouping agent s , the second one for either action-selection or s u b-grouping, and the last one for acti o n selection . Also, a Central-V algorithm is presented for training actor and critic networks. In Shu and T ian ( 2019 ) a different sett ing o f the multi-agent system is considered. In th is problem, a m anager alon g with a set of self-interested agents (workers) with different skills and preferences work on a set of t ask s. In this setting, th e agents li ke t o work on their preferred task s (which may 42 not be proﬁtable for the enti re project) unless they off ered the right bonu s for doin g a di fferent task. Furthermore, the manager does not know the skil ls and preferences (or any distribution o f them) o f each individual agent in advance. T he p rob lem goal is to train the manager to control the worke rs by in ferring t heir minds and assigning incentives to them u p on the completion of particul ar g o als. The approach includ es three main mod u les. (i) Identiﬁcation, which uses workers’ performance history to recognize the i dentity of agents . In particul ar , the performance history o f agent i i s denoted by P i = { P t i = ( ρ t ig b ) : t = 1 , 2 , · · · , T } , where ρ t ig b is the probabi lity of worker i ﬁnishes th e goal g in t steps with b bonuses. In this module, these matrices are ﬂattened i n to a vector and encoded to history representation denoted by h i . (ii) Modeling the behavior of agents. A work er’ s m ind is modeled by it s performance, intentio ns, and skills. In mind tracker m odule, the manager encodes both current and past in formation to updates its beliefs about the workers. Formally , let’ s Γ t i = ( s τ i , a τ i , g τ i , b τ i ) : τ = 1 , 2 , · · · , t denot es the trajectory of worker i . Then mind tracker mod ule M receives Γ t i as well as hist ory representation h i from the ﬁrst mod u le and outputs m i as the m ind for agent i . (iii) Tra ining th e m anager , wh i ch includes assigning the goals and bonuses to the workers. T o this end, t he m anager needs to hav e all workers as a context deﬁned as c t +1 = C ( { ( s t +1 i , m t i , h i ) : i = 1 , 2 , · · · , N } ) , where C pools all workers information. Then utilizing both individual informati on and the context, the manager modul e p rovides th e goal policy π g and bonus policy π b for all worke rs. All three modu l es are trained using the A2C algorithm . The proposed algorithm is ev aluated in two en vironments: Resource Collecti o n and Crafting in 2D Minecraft. The results demons t rate that th e manager can estimate the workers’ mind through monitoring their behavior and motivate t hem to accomplish th e tasks they do not p refer . Next, we dis cuss MARL in a hierarchical setting. T o do so, l et us bri eﬂy int roduce the hierarchical RL. In this s et t ing, the problem is decomposed into a hierarchy of t ask s such that easy-to-learn tasks are at the lower level of the hierarchy and a strategy to select those tasks can be learned at a higher level of the hi erarchy . Thus, i n the hierarchical setting, the decisions at th e h i gh level are made less frequently than those at t h e lower lev el, which usu al l y happens at e very step. The high-leve l policy is mainly focused on long-run pl ann ing, which i n volves s e vera l on e-st ep tasks in the low-lev el of the hierarchy . F ollowing this approach, in singl e-agent h ierarchical RL (e.g. Kulkarni et al. ( 2016 ), V ezhnevets et al . ( 2017 )), a meta-controller at the high-lev el learns a pol icy to sel ect t h e sequence of tasks and a separate policy is trained to perform each task at the low-le vel. For th e h i erarchical mu lti-agent systems, two possibl e scenarios are synchronous and asy n - chronous. In the synchronous hierarchical m ulti-agent systems, all high-leve l agents take action at the same time. In other words, if one agent t ake s i t s low-le vel actions earlier than other agents, it has to wait u n til all agents ﬁnish their l o w-le vel action s. This could be a restricted assumpt ion if the num ber of agents is qui t e large. On the other hand, there is no restriction on asynchronou s 43 hierarchical multi -agent systems. Non et h eless, ob taining high-level cooperation in asynchronous cases is challenging. In the following, we study some recent papers in hierarchical MARL. In T ang et al. ( 2018 ) a cooperativ e problem wit h sparse and delayed re wards is considered, in which each agent accesses a local o b serv ation, takes a local actio n , and submit the joi nt action into the en vironment to get t he l ocal rew ards. Each agent has some low-le vel and high-level actions to take such that the problem of the task s election for each agent can b e modeled as a hierarchical RL probl em. T o sol ve t his problem, three algorithms are proposed: Independent hDQN (Ind- hDQN), hi erarchical Communi cati on networks (hCom), and hi erarchical hQmix. Ind-hDQN is based on the hierarchical DQN (hDQN) ( Kulkarni et al. 20 16 ) and decompos es the cooperativ e problem i nto independent goal s and then learns them in a hi erarchical manner . In order to analyze Ind-hDQN, ﬁrst, we d escrib e hDQ N— for the singl e-agent— and then explain Ind-hDQN for mult i- agent setting. In hDQN, the meta-controll er i s mod el ed as a semi-MDP (SMDP) and the aim is to maximize ˜ r t = R ( s t + τ | s t , g t ) = r t + · · · + r t + τ , where, g t is the selected goal by the met a-control ler and τ is t he stochastic number of p eriods to achie ve the goal. V ia ˜ r t , a DQN algorit h m learns t he meta-control ler policy . This pol icy decides which l ow-lev el task sh ould be taken at each tim e step. Then, the low-lev el po licy learns t o maxi- mize the goal-dependent re ward ˆ r t . In Ind-hDQN i t is assumed that agent i knows local observation o t i , its meta-controller learns policy π i ( g t i | o t i ) , and in the low-le vel it learns pol icy ˆ π i ( a t i | g t i ) to in- teract wi t h the en vironment. The low-le vel policy is trained by the en viron ment’ s re ward signals r t i and the m eta-controller’ s po l icy is trained by t he intri n sic rew ard ˆ r t i . Since Ind-hDQN trains independent agent s, it can b e applied to both synchronous and asynchronous setting s . In the second algorithm , named hCom, th e idea of CommNet ( Sukhbaatar et al. 2016 ) is combined with Ind-hD Q N. In thi s way , Ind-hDQN’ s neural network is modiﬁed to i nclude the avera ge of the h th hidden layers of other agents, i.e., it is added as the ( h + 1) th layer of each agent. Similar t o Ind- hDQN, hCom works for both synchronous and asynchronous sett i ngs. The third algorithm, hQmi x, is based o n Qmix ( Rashid et al. 2018 ) to handle the case that all agents share a joint re ward r t . T o this end, the Qm i x architecture is added to the meta-control ler and as a result, the Qmix allows training separated Q-values for each agent. This i s possibl e by learning Q tot as is directed in the Qmix. hQmix only is appl icable for synchronous settin gs, since Q tot is estimated over j o int-action of al l agents. In each of the proposed algorithm s, th e neural network’ s weigh ts of the policy are shared am ong the tasks t hat hav e the same input and out p ut di mensions. M oreov er , the weights of the neural network are shared among the agents for the l o w-le vel policies. Thu s , only one low-le vel network is trained; although, it can be u s ed for di f ferent tasks and by all agents . In addition, a n ew experience replay , Augmented Concurrent Exp erience Replay (A CER), is proposed. A CER sav es 44 transition tu p le ( o t i , g t i , ˜ r t i , τ , o t + τ i ) for meta-control l er and saves AE i ( t, τ ) = { ( o t + k i , g t i , ˜ r t + k i , τ − k , o t + τ i ) } τ − 1 k =0 to train the low-le vel pol icy . A CER also uses the idea of Concurrent Experience Replay Tra jectories (CER Ts) ( Omidshaﬁei et al. 2017 ) such that experiences are stored in the rows of episodes and colum ns o f time steps to ensure ava ilability of concurrent mi ni-batches. They hav e analyzed their algo ri t hm in Multiagent T rash Collection tasks (an extension of environment Makar et al. ( 2001 )) and Fev er Basketball Defense. In the experiments, th e low-le ve l learning is d o ne for sev eral homog eneous agents so t hat it can be consid ered as a single-agnet learning problem. The results are compared with Ind -DQN and Ind-DDQN wi t h pri o ritized experience replay ( Schaul et al. 2016 ). In t he literature of M ARL, there are several papers which consider t he games with Nash equili b - rium. In a multi-agent system, Nash equilibriu m is achiev ed, if all agents get their own highest possible va lue-function and not will ing to change it. Here we only dis cuss a few recent papers since Lanctot et al. ( 2017 ) provide a detailed re view of th e ol der p apers and Y ang and W ang ( 2020 ) present a full re view from the gam e theoretical perspective on the mul ti-agent formulati on and MARL algorithms. Zhang et al. ( 2018a ) discuss t he coo rdi nation p roblem in the mu l ti-agent coop erative d o mains wi th a ful ly ob s erv able state and continues actions to ﬁnd the Pareto-optimal Nash-equil ibrium. T h e paper proposes an algorithm nam ed Sample Conti nuous Coordination wi th recursive Frequency Maximum Q-V alue (SCC-rFMQ) which includ es two main parts: (i) give n st ate s , a set of discrete actions from the continu es set A i ( s ) are s elected for agent i , (ii) the action ev aluation and training the poli c y are performed. The ﬁrst phase inv o lves selecting a set o f good action s which are seen before while performs exploration. In this way , it foll o ws a Coordi nation Re-sample (CR) strategy that preserves n/ 3 best previous actions for each agent . The rest of the actio ns are s elected accord- ing t o th e variable probabilit y distribution, i.e. get actions randomly via N ( a max ( s ) , σ ) , in which a max ( s ) is the action that giv es maxim um Q-va lue for state s . Let use a ∗ ( s ) to denote the action of the best seen Q-v alue. If a max ( s ) 6 = a ∗ ( s ) , exploration rate σ i ( s ) is reset to the initial value of 1/3; otherwise i f V ( s ) < Q i ( s, a max ) , CR shrinks σ i ( s ) with a giv en rate; and expands it otherwise. Using t he new σ i ( s ) , new actions are s elected with N ( a max , σ ) and resets A i ( s ) and Q ( s, a ) . Beside CR, Zhang et al. ( 2018a ) also ut ilize rFMQ ( Matignon et al. 2012 ), whi ch extends Q- Learning with the frequency value F ( s, a ) . In t h i s way , in addit i on to Q ( s, a ) , Q max ( s, a ) is also ob- tained and updated through the learning procedure. T o select actions , an e valuation-v alue function E ( s, a ) is obt ai n ed t o run greedily such that E ( s, a ) = (1 − F ( s, a )) Q ( s, a ) + F ( s, a ) Q max ( s, a ) and F ( s, a ) estimates th e percentage of time that action a results in observing the maxim um re- ward, for state s . The estimatio n of the frequency F ( s, a ) i s recursively updated via a s eparate learning rate. In order to sh o w the ef fecti veness of SCC-rFMQ, t h e results of a cli m bing game, 45 Category Problem Goal Algorithm W eb service T ask scheduling f or web servic e s W ang et al. ( 20 16a ) Minimize time and cost T abular Q- le a rning Traf ﬁc control Control multi trafﬁc signals Prabuchandran et al. ( 2 014 ) Minimize q ueue on a neighb orhood T abular Q-lea rning Traf ﬁc control Control multi-in tersection trafﬁc signal W ei et al. ( 2 019a ) Minimize to ta l travel time IQL Traf ﬁc control Control multi-in tersection trafﬁc signal Gong et al. ( 2019 ) Minimize cu mulativ e delay IQL with DDQN Traf ﬁc control Control multi-in tersection trafﬁc signals Chu et al. ( 2019 ) Minimize queue IAC Traf ﬁc control Control multi-in tersection’ s tra fﬁc signal W ei et al. ( 201 9b ) Minimize q ueue A C with attentio n Traf ﬁc control Control single and multi-intersection’s trafﬁc signal Zheng et al. ( 2019 ) Minimize q ueue IQL with Ap eX-DQN Traf ﬁc control Ride-sharing manag ement Lin et al. ( 2018 ) Improve resource u tilizatio n Q-learning a n d A C Traf ﬁc control Air-trafﬁc contr o l Brittain an d W ei ( 2019 ) Conﬂict resolu tio n CTDE A2 C Traf ﬁc control Bike re-balancin g pr oblem Xiao ( 2018 ) Improve trips freq uency & bike usage tabular Q-learnin g Resource allocation Online resour c e a llo cation W u et al. ( 2011 ), Wu and Xu ( 201 8 ) maximize the utility of server tabular Q-learning Resource allocation Packet routing in wireless sen sor networks Y e et al. ( 2015 ) Minimize co nsumed energy Q-learning Robot p a th planning Multi agen t p ath ﬁnding w ith static obstacles Sartoretti et al. ( 2019a ) Find the shortest path IAC with A3C Robot p a th planning Multi Agen t path ﬁnding with dynamic obstacles W ang et al. ( 2 020a ) Find the shortest path Double DQN Production systems Production con trol, jo b-shop scheduling Dittrich and Fohlmeister ( 2 020 ) Minimize average cycle time DQN Production systems Transportation in semicon ductor fabr ication Ahn and Park ( 2 0 21 ) Minimize r e tr ie val tim e A C Image classiﬁcation Image classiﬁcation with swarm of robots Mousavi et al. ( 20 19 ) Minimizing classiﬁcation error REINFORCE Stock mar ket Liquida tin g of a large amoun t of stock Bao an d Liu ( 20 19 ) Maximize the liqu idation sell value DDPG Stock mar ket Buy or sell stocks Lee et al. ( 2007 ) Maximize the proﬁt Q-learning Maintenance plann ing Maintenance managem ent Andriotis a n d Papakonstantinou ( 2019 ) Minimize life- cycle co st AC b ased T able 2: A summary of app lications for multi-agen t p r oblems with MARL algo r ithms which are reviewed in this paper . a t wo-player game with Nash equilibrium are presented, as well as the boat navigation probl em with t wo actions. SCC-rF MQ is compared with MADDPG, Sequential Monte Carlo M ethods (SMC) ( Lazaric et al. 2 008 ), and rFMQ ( Matignon et al. 2012 ). SMC ( Lazaric et al. 2008 ) i tself is an actor-critic algorithm with continues action space. In th is algori t hm, the actor takes action randomly such that the probabi l ity of extraction of each action is equal to the i mportance weight of t he action. Also, the critic approxim ates an action-value functi o n based on the observed rew ard of the playin g action. Then, based on t h e fun ction, the actor updates the policy di strib ution. In thi s way , for each state, the actor provides a probability distribution over the continuous action space and based on the importance of sampl i ng, t he action is selected. 10 Ap plications The multi-agent probl em and MARL algorit hms have numerous applications in the real world. In this section , we re view som e of the application-oriented papers, in which the problem i s mod el ed as a multi-agent problem and an MARL is utilized t o solve that. The main focus will be on de- scribing the problem and what kind of approach i s used to solve the problem so that t h ere are not much o f t echnical details about the probl em and algorithm. T able 2 provides a summary of so me of the iconic re viewed papers in this section. As i t is shown, the IQL approach is the m ost uti- lized approach in the application papers. For more detail s about each paper see the corresponding section. 46 10.1 W eb Service Composition A mu l ti-agent Q -l earning algorithm has been proposed in W ang et al. ( 2016a ) for dynami c web service composition problem. In thi s problem, th ere exists a sequence of tasks that need to be done in order to accomplis h the web service composition problem. The key id ea i n this work i s to decompose the m ain t ask in t o in dependent s u b-tasks. T h en a tab ular-ba sed Q-learning algorithm is appl ied to ﬁnd the optim al strategy for the su b-task. In additi on, t o improve th e ef ﬁciency of the proposed m ethod t hey propos ed the experience s h aring strategy . In this case, there e xists a supervisor agent, who is responsible to communicate with all o t her agents to spread the existing knowledge in a particular agent among all other ones. 10.2 T raf ﬁc Control Prab uchandran et al. ( 2014 ) propose a tabular Q-learning algori t hm to cont rol multiple trafﬁc s i g- nals on neighbor j u nctions to maximize t he trafﬁ c ﬂow . Each agent (the t raf ﬁc light) accesses the local observations, includin g the nu mber of lanes and the queue length at each lane, decides about the green li ght duration, and shares the qu eue length w i th its neighbors. The cost of each agent is the average of t h e qu eue length at its neighbors so that it tri es to minim ize the queue leng t h of i ts own queue as well as all its neigh b ors. The algorithm is compared with two classi cal approaches in a simu lated en vironment of two areas with 9 and 12 juncti ons in India. A similar problem with the RL appro ach is stu died i n Abdoos et al. ( 2011 ). Also, recently Zhang et al. ( 2019a ) provided CityFlow , a ne w traf ﬁc-signal en vironment to be used for MARL researches. CityFlow has been used in many trafﬁ c s ignal con trol problems. In an int ersectio n, there are some predeﬁned sets of phases—determi ned based on the structure of the intersection— and the goal of the problem can be translat ed into deciding about t he sequence o f these phases to minimize the total tra vel of all vehicles. Ho we ver , tot al travel time is not a direct function of state and acti o ns in an in- tersection, s o that usually auxil iary objectiv e functions like minimizi n g queue lengt h, waiting tim e, or delay time are considered. A variety of t rafﬁc statistics s uch as the nu mber of moving/waiting cars i n each lane, the queue length, the number of waiting cars in each lane, etc., can be us ed as the state s t . The acti o n set is usually deﬁned as the set of all possib le phases. T ypically , the rew ard is deﬁned as a combinati on of several com ponents such as queue length, th e waiting time of the cars, intersection pressure, etc. See W ei et al. ( 2019c ) for a detail ed re view . W ei et al. ( 20 1 9a ) consider the mult i-intersection trafﬁ c signal control problem and propose an IQL type algorithm to solve it. Each int ersectio n is considered as an RL agent, which observes the cur - rent phase, the number of cars o n the outgoing road, and t he num ber of cars in each segment of the incoming road. The action i s to decide about the next active phase, and the rew ard of each intersec- 47 tion is the negati ve of the corresponding pressure. There is no parameter sharing among t he agents, and each agent trains it s own weights . Numerical experiments on sev eral synt hetic and real-world traf ﬁc cases are conduct ed to s ho w t he p erformance of the algorithm . In a similar paper , Gong et al . ( 2019 ) consider the same problem and proposes an IQL based al g orithm. T o obtai n the state, ﬁrst, each int ersection i s divided into s everal chunks to build a matrix i n which each chunk includ es a binary variable i n dicating the existence of a vehicle. Then, to get the state of each i n tersection, t he matrix of the considered intersection and i t s upstream and downstream intersection s are obtained and concatenated together to miti gate the complexity of the multi-agent problem. The re ward is deﬁned as t he differe nce between the waiting times of all vehicles between two consecut ive cycles, and the action is the next p h ase t o run. The go al of t h e model is to m inimize the cumulative delay of all vehicles. T o solve the problem an IQL approach is proposed in which t he agents are trained with the double dueling deep Q network ( W ang et al. 2016c ) algorithm, where a CNN network along w i th an FC layer is used to obtain the adv antage values. T o explore the performance of the algorithm, a commercial t rafﬁc s imulator , Aims un Next, is used as the en v ironment and the real-world data from Florida is used in which eight trafﬁ c si gnals are controll ed by the proposed algorithm. Cooperation among the agents plays a piv otal role in the traf ﬁc signal control problem since the action for every individual agent will directly inﬂuence t he o ther agents. There have been some ef forts to remedy this issue. F or example, Prashanth and Bhatnagar ( 201 0 ) con s ider a central con- troller th at watches and cont rols all oth er agents. This strategy suffer s from the curse of dim en- sionality . Another approach is to assum e that agents could share their states amo n g the neighbors ( Arel et al. 2 0 10 ). F or example, Chu et al. ( 2019 ) sho w t hat sharing local in formation could be very helpful thoug h keep the algorithm practical. They propose MA2C, a fully cooperative i n which each intersection trains an independent advantage actor-cr itic where it allows sharing the observations and probability simpl ex of th e policy with t he neighbor agents. So, the agent has some informati on about regional trafﬁc and can t ry to alleviate that rath er than focusing on a s elf- oriented pol i cy to reduce th e traf ﬁc in a s i ngle intersection. T o balance the importance of the local and shared i n formation, a spatial d i scount factor i s considered to scale the effec t o f the shared ob- serva tion and rew ards. Each agent represents its state by the cumulative delay of the ﬁrst vehicle on the in t ersection from time t − 1 to time t , as well as the number of approaching cars within a giv en dis t ance to the i n tersection. Each agent choo ses it s next phase and is reward locally by t he weighted sum of the queue length along each incom ing lane and w ait time of cars in each lane of the intersection. MA2C is ev alu ated on a large trafﬁc grid with both synthetic and real-world traf ﬁc data and the results are compared with IA2C, and IQL. 48 Even though some algorithms considered sharing some inform at i on among the agents, st i ll it is not known ho w important that information are for each agent. T o address t his issue W ei et al. ( 2019b ) proposed CoLight, an attention-based RL algorit hm. Each intersectio n learns weights for e very other agent, and the weighted sum of the neighb ors’ stat e is used by each agent. In CoLight, the state includes the current on e-hot -coded phase as well as the number of cars i n each lane of the roads, th e acti o n is choo sing the next phase, and t he goal is to mi n imize th e average queue l ength on each intersection. All intersections share the p arameters so that only one network needs to be trained. Synthetic and real-world data-set is used to show the performance of the algorithm, includi n g trafﬁc network in J i nan with 12 intersections, Hangzhou with 16 intersections, and Manhattan with 196 intersections. None of the mentioned algorithm s can address a city-level probl em , i.e., thousands of t he inter - section. The main issu es are (i) the local rewar d maxi m ization does n o t guarantee global rew ard, and gathering the required data is quite challenging , (ii) th e action of each intersection affects the others so that the coordination is required to minimize the total t ravel t i me. T o address these is- sues, Chacha Chen et al. ( 2020 ) proposed MPLight, an RL algorithm for large-scale traf ﬁc signal control system s. The st ate for each agent consis ts of t h e current phase and the 12 po s sible p res- sure values of th e 1 2 trafﬁc movements. The intersections with a smaller n umber of m ovements are zero-padded. The action is to choo s e one of eigh t possib l e phases, and the local re ward is the pressure o f the intersection. A DQN alg orithm is propo sed with parameter sharing am ong the intersections. Both syn thetic and real-world data-sets are us ed, from which Manhattan with 2510 signals is the largest analyzed n etwork. In Zheng et al. ( 2019 ) an RL-based algorit h m, call ed FRAP , was proposed for the traf ﬁc signal control p roblem. The key property of FRAP i s inv arianc y to s ymmetric operation s such as rotation and ﬂip. T o ward this end, two principles of competition are utilized: (i) t hat l ar ger trafﬁc movement indicates higher demand, and (ii) t h e line wi th higher trafﬁc (demand) is p rioritized to t he line with lower trafﬁc. In FRAP , the state is d eﬁned as th e phase and num b er o f vehicles at each lane, the action is choosing the next phase, and the re ward is the queue length. The p rop osed model includes three parts: (i ) phase demand modeling, which provides the embedded s u m of all green-trafﬁc movements on each phase, (ii) phase pair embedd i ng, in which the m ovement-conﬂict and phase- demand matrices are built and embedded to the required s izes, and (iii) phase pair competition, which runs two con volutions n eural network and then some fully connected layers t o obtain the ﬁnal Q-values to choose the actio n. In a related area, Lin et al. ( 2018 ) consider ride-sharing m anagement p rob lem and propose a cus- tomized Q-learning and an actor-critic algorithm for t his prob lem. The algorith m s are ev aluated on a built sim ulation with Didi Chuxing h e, which is the largest ride-sharing company in China. 49 The goal is to match t he d em and and su p ply to im prov e the uti l ization of transport at i on resources. In a simi l ar direction, Brittain and W ei ( 2019 ) stu dy the air-traf ﬁc cont rol problem with a MARL algorithm to ensure the safe separation between aircraft. Each airplane, as a l earning agent, shares the state information with N closest agents. Each state includes dist ance to the go al, speed, acceler - ation, dist ance to the in t ersection, and the d istance to the N closest airplanes. Each agent d ecides about t he change of speed from three possib le choices and recei ves a penalty if it is in a close distance of another airplane. The goal of the model is to identify and address the conﬂicts between air -crafts in h igh-density intersections . The A2C algorit hm is used, while a centralized training decentralized execution approach is followed. The actor and critic network share the weigh ts. BlueSky air trafﬁc control simulator is used as the en v i ronment and the results of a case with 30 airplanes are provided. In Xiao ( 2018 ), a distributed tabular Q-learning algorithm was proposed for bi ke re balancing prob- lem. Using distributed RL, they improve th e current solutions i n terms o f frequency of trips and the num ber of bikes are being moved on each trip. In the problem setup, the action is t h e num- ber of bikes to m ov e. The agents receive a posit i ve reward if t h e bike st ock is within a particular range, and a negative rewar d otherwise. Al so, th ere exists a negative rew ard for eve ry bike m ov es in each ho ur . Each agent acts independently; howev er , there exists a controller called a knowl- edge reposit o ry (KR) t hat shares the l earning inform ation am ong the agents. Mo re s peciﬁcally , the KR is designed to facilitate t he t ransfer learning ( Lazaric 2012 ) among the agents. Using only distributed RL the success ratio ( the number o f successfully rebalanced stations o ver the total stations) is imp roved about 10% to 35% . Furthermore, com bining with the t ransfer learning, the algorithm rebalances the network 62 . 4 % better t han the one wi thout transfer l earning. 10.3 Resource Alloca t ion Zhang et al. ( 2009 ) consider an onlin e distributed r eso ur ce allocation problem, such t hat each agent is a server and observes the informati o n of its neighb o rs. In each tim e st ep, each agent recei ves a task and has to decide to allocate it locally , or has to pick a neighbor and send it to that neigh bor . Once the t ask is ﬁnished, the agent is rew arded with som e utility of t he task. Due to the com munication bandwidth constraints, the number of tasks that each agent can s end to its neighbors is lim ited and the goal is to cooperatively maximize the uti lity of the whole clust er . An algorithm based on t abular Q-learning is proposed and its results are compared wi th a centralized controller’ s result. W u et al. ( 2011 ) also consid er a similar problem and propos e another v alue- based algorith m . Moreover , in W u and Xu ( 2018 ) a sim i lar problem is studied in whi ch there are n schedulers wh o dispatches jobs of k customers in m machines. They also propose a value- based algori t hm and us e a gossip mechanism to transfer utilit ies amon g the neighbor agents. In a 50 similar dom ain, Y e et al. ( 201 5 ) consider a packet routing problem in t he wireless sensor networks, model i t as a multi-agent problem. In this probl em , the sensor data should be sent to a base station to analyze them , though usual l y each senso ry unit has a limi ted capacity to st ore data, and there are st ri ct communi cati o n bandwidth limits. T h is p roblem has sev eral applications in surveillance, video recording, processing and communicating. When a packet is sent from a given sensory unit, the distance to the destination and the size of the packet determines the required ener gy to send the packet, and one o f the goals of the system is to mini m ize the sum of con s umed ener gy . They propose an MARL algorithm based on th e Q-l earning in which each agent selects some cooperatin g neighb ors in a given radius and then can comm u nicate with t hem. The results are compared with several classical algorithms. W ithin a similar domain, Dandanov et al. ( 2017 ) propose an RL algorithm to deal with the antenn a tilt angle in mobile antenna, t o get a compromise between the mobile coverage and the capacity of the network. The rew ard matrix of the problem is built and transitio n probability among the states are known so th at optimal value for each state- action is achieved. 10.4 Robot Path P lanning Multi-agent path ﬁndin g (MAPF) problem is an NP-hard problem ( Ma et al. 2019 , LaV alle 2006 ). The goal is to ﬁnd the pat h for several agents in a system with obstacles for going from given source and destination s for each agent. The algorithms t o solve MAPF , in general, can be cat- egorized into three classes: coupled, decoupled, and dynamically-coupled methods. Th e Cou- pled approaches treat M APF as a singl e high-dimens i onal agent, wh i ch results in the exponential growth of complexity . On the other hand, decoupled approaches obtain a s eparate path for each agent, and adjust the paths with the goal o f zero-collisio ns. Decoupled approaches are able to quickly obtain sol u tions for large problems ( Leroy et al. 1999 , V an Den Ber g et al. 2011 ). One of the common approaches to path adjustments is "velocity planning" ( Cui et al. 2012 , Chen et al. 2017 ), which m odiﬁes the velocity proﬁle of each agent along with it s path to av oid col lisions. Similarly , p riority p l anning can be used to allow utilizin g the fastest path and speed for the agents with higher priorit y ( Ma et al. 20 1 6 , Cáp et al. 2013 ). Even though decoupled approaches can be used for a lar ge number of agent s, th e y are n ot very effecti ve s ince they consider a smal l portio n of the joi nt conﬁguration space and search through low-dimensional spaces ( Sanchez and Latom be 2002 ). Dynamically coupled approaches are proposed to sol ve these issues. These approached lie between coupled and decoup l ed approaches. For example, Conﬂict-Based Search (CBS) ﬁnds the opti m al or semi-optim al path s without s earching in high-dim ensional spaces by building a set of constraints and p l anning for each agent ( Barer et al. 2014 , Sharon et al. 2015 ). Also , al- lowing on-demand growth in the search sp ace during planning ( W agner and Choset 2015 ) is an- other approach. On the other h and, the location of obstacles may be considered static or dynamic 51 ( Smierzchalski and Michalewicz 2005 ) which results in anoth er two categorizations for algorithms, off -line and on-lin e planning, respectiv ely . Sartoretti et al. ( 2019a ) consider M APF problem with st atic obstacles and p rop osed PRIMAL, an IQL based approach for decentralized MAPF . PRIMAL uses RL and imitation learning (IL) to learn from an expert centralized M APF planner . W ith PRIMAL, t he agents use RL to learn efﬁ cient single-agent p ath planning, and IL is used to efﬁciently learn actions t h at can af fect ot her agents and the whole team’ s beneﬁt. This eli minates the need for explicit communi cation among agents during the ex ecution. They considered a grid-world discrete s tate, and have deﬁned th e local state of each agent as the all information o f a 10 × 10-cell bl o ck centered where the agent is located, along with the unit vector directing towar d the goal. Actions are four possible moving (if they are all owed) plu s staying still, and the reward is a penalty for each m ov e and a hig h er penalty for staying stil l . Also, the collision results in a penalty and reaching th e goal has a large positive re ward. An algorithm based on A3C is used to train each agent locally , which may results in selﬁsh behavior o n each agent, leading to locally optim i zed actions for each agent. T o address this issue three methods are propo s ed: (i) blocking penalty , (ii) imitatio n learning, (iii) randomized the size and obstacle densit y of the world. T o use IL, a good heuristic algorithm is used to generate trajectories which are used alo n g wit h the RL generated trajectories to train the model. It i s shown that each of the th ree m ethods needs t o be in t he model and removing either of t h em results in a big loss in the accuracy of th e model. The model is compared with the heuristic approaches which access t he full observation of the en vironment. They also implement ed PRIMAL on a sm all ﬂeet of aut onomous ground vehicles in a factory mockup. W ang et al. ( 20 2 0a ) prop o sed globally guided reinforcement learning (G2RL) t hat uses a re ward structure to generalize to any arbitrary en vironments. G2RL ﬁrst call s a global guidance algori thm to get a path for each agent and then during th e rob ot motion, a l o cal double deep Q-Learning (DDQN) ( V an Hasselt et al. 201 6 ) based planner is used t o generate actions t o av oid the static and dynamic o bstacles. The global guidance algorithm o bserves the wh ole map and all static obs ta- cles. T o make the RL agent capable of performing enough exploration, a dense re ward functi o n is proposed which encourages the agent to explore freely and tries t o not force the agent to s trictly follow the global guidance. T h e global moti on planning is calculated once and remai n s the same during the moti o n. The dynamic obstacles may move in each tim e-step and their location become known t o the agents when they are wi thin a given distance to the agent. Sim ilar to Sartoretti et al. ( 2019a ), the agent has ﬁ ve po ssible actions. The goal is to minim i ze the number of steps which all agent need to go from the start point t o the end p o int. T o train the RL agent, on each agent the l o cal observation is pas s ed into a transformer funct i on, and the output is passed into a CNN-LSTM-FC network to get the acti on. In th e local observation, the 52 static obstacle, the dyn am ic obstacle, and t h e proposed route by th e gl obal gu idance are depicted with different colors. The agent uses the Double-DQN algorithm to obtain it s local actions. Since an IQL-based approach is us ed to train the agent, it can be used i n a system wit h any number o f agents. PRIMAL is compared with se veral central controller type algorithms t o demonstrate its performance in different cases. 10.5 Production Systems In Dittrich and Fohlmeister ( 2020 ), a cooperative MARL is proposed for production control prob- lem. The ke y idea in this paper is to use a decentralized control system to reduce the p roblem complexity and add the capabili ty of real-time decisio n makin g s. Howe ver , this causes local op- timal soluti o ns. T o overcome this lim itation, cooperative beha vior is considered for agent s . T o perform the idea, a central module that contains a deep Q-learning algorithm is cons i dered. This DQN mo d ule, plus a decentralized multi -agent system (MAS) communicates t o the manufacturing system (i.e., en vironment). The MAS m odule consists of two ty p es of agents, namely , order agents and machine agent s . The order agents make schedulin g decisions based on the work plan, and the machine agents keep t he machines information. These agents are able to col lect some l ocal data and the order agents hav e the capabil i ty of m aki ng so m e local decisio n s foll o wing the in s tructions from th e D Q N modu l e. For each released order i , the sub sequent g orders are grou p ed and denoted by G i . The state is deﬁned as the data related to the ord ers and the action represents the a vailable machine tools for processing p arti cular orders. The rewar d contains local re ward and global re- ward. Th e local reward is deﬁned i n a way to encourage order agents for cho o sing the fastest route, where the global rew ard t ake s higher values when the deviation between the actual lead tim e and the target lead time is sm al l er . The proposed framework is tested on a jo b shop problem with three processing s t eps and the resul ts are com pared to a capacity-based solut ion. One of th e o ther studied problems with MARL is the overhead h o i st transportation (OHT) problem in semicondu ct o r fabrication (F AB) sy s tems. Three main problem s exist in F A B systems: (i) The dispatching problem l ooks for the assignment of idle OHTs to ne w loads, like moving a bundle of wafers. The goal of this problem is to min i mize av erage deliver y ti me or to m aximize t he resource utilization ratio for the material handling. On th i s problem, several constraints like deadlines o r job prio ri t ies should be considered. (ii) The probl em of determining the opti m al path of the OHTs moving from a source machin e t o a destination machine. The goal of this problem is usually to minimize the to tal travel time or t he total tardiness. (iii) The rebalancing p roblem, which aims to reallocate idle OH T s over di f ferent sections of F AB system. The goal is to mi nimize the time between the ass i gnment of the load to an OHT and the start of t he deliver y , which is called retriev al time. Ahn and Park ( 2021 ) consi d er the third prob l em and propose a reinforcement learning algo- 53 rithm b ased on the graph neural networks to mini m ize the a verage retriev al time of the id le OHTs. Each agent is considered as t h e area that contains two machines to perform a process on the semi- conductor . Each agent decides to move or n o t move an idle OHT from its zone to its n ei g hboring zone. The local state of agent i at time t includes (i) the num ber of the idle OHTs at time t in zone i , (ii) the number of working idle OHTs at ti me t in zone i , (iii) and the number of loads waiting on zone i at ti me t . The obs erv ation of agent i at ti me t inclu d es the s j t for all neighbor zones j which share their state with zone i . Then, a graph neural network is us ed to embed the observation of each agent t o a vector o f a given size. In the graph, each node represents a zone, and the node feature is the state of that zone. The edge of the g raph is the number of OHTs moving from one node to anot her . The policy is a function o f the embedded obs erv ation, includ i ng the state of the neighbor zones. An actor -critic algorit hm is propo s ed, in w h ich th e po l icy parameters are shared among all the agents and the critic m odel us es the global state of the en vi ronment. Seve ral n umer - ical experiments are presented to show the ef fecti veness of th e proposed model compared to some heuristic m odels. 10.6 Image Classiﬁcation In Mousavi et al. ( 2019 ) the autho rs s ho w that a decentralized mul ti-agent reinforcement learning can be used for image class iﬁcation. In t heir framew ork, multiple agents receive partial observa- tions of the en vironment, communi cate with the neighbors on a com m unication graph, and relocate to update their lo call y av ailable informati on. Using an extension of the REINFORCE algorithm, an algorithm is p roposed t o upd ate the prediction and motion planning modules in an end-t o -end manner . The result on M NIST d at a-set is provided in which each agent o n ly ob s erve s a fe w pixels of the image and has used an LSTM network to l earn th e policy . A si milar prob lem and app roach is foll owed in Mousavi et al. ( 201 9 ) on the MNIST data-set. 10.7 Stock Market Bao and Liu ( 2019 ) consider the liquidation of a large amo u nt of stock in a lim ited time T . The liquidation process usu ally i s done with th e cooperation of several traders/brokers and massively impacts the market. The problem becomes more complex when o t her entiti es want to li quidate the same stock in the m arke t. This problem can be modeled as a multi -agent system with (i ) competitive objectives: when each agent wants t o sell it s own stock with t he highest price, (ii) cooperativ e objective: when s everal agents want to cooperative ly sell the stock o f on e customer at the high est price. The problem is modeled as multi-agent sys t em in which each agent has to select to sell a ∈ [0 , 1 ] percent of stocks in each time step. If the agent selects to sel l nothing, it takes the risk of dropped prices and at th e end of T time periods, the trader has to sell all remaini n g 54 stocks even with zero price. An adapted version of t he DDPG algorithm is proposed t o solve this problem. In a related probl em, Lee et al. ( 2007 ) proposed MQ-T rader , which makes buy and sell suggestions in the stock exchange market. MQ-T rader consists of four cooperative Q-learning agents: b uy and sell signal agents, which determine a bi nary action for buy/discard or sell /hold, respecti vely . Th es e agents want to determine the right tim e t o b uy or sell stocks. The oth er agents, buy and s ell order agents, decide about the buy price and sell price, respectiv ely . Th ese agents coop erate to maximize proﬁtability . The i ntuition behind th i s system is to effecti vely divide the complex st ock trading problem into simp l er sub-problems. So, each agent needs to learn specialized knowledge for its el f, i.e., buy/sell and price decision s . The state for the signal agents is represented by a m atrix w h ich is ﬁlled with a function of long ti me price history data. On the o t her hand, t he order agents d o not need the lo n g history of the price and use t he history o f prices i n the day to det ermi ne the price. Th e action for the order agents is t o choos e a best-price ratio over the moving av erage of the price. And, the reward is given as the obt ained proﬁt following the action. The p erformance of the algorithm is analyzed on K OSPI 200 whi ch i ncludes 200 major stocks in the K orea stock exchange market and its result s are compared with som e existing benchmarks. 10.8 Maintenance Management As an application in civil engineering, Andriotis and Papakonstantinou ( 2019 ) propos e a MARL algorithm for efﬁc ient maintenance management of s t ructures, e.g. bridges, hospitals, etc. Each structure has m > 1 components and the maintenance plan has t o consi der all of them . The goal is to ﬁnd t h e opti m al maintenance pl an for t h e whole s tructure, not the opt imal policy for separated components. It i s assum ed that all com ponents observe the global state, and a shared re ward is known for all agents. An actor -critic algorithm called Deep Centralized Multi-agent Actor-Critic (DCMA C), propos ed t o solve th i s problem. DCMA C assumes that g iven the global state, actions of dif ferent component s are conditionally independent. This way the authors d eal with the no n- stationary issue i n multi-agent systems. Therefor , a centrali zed va lue-function is train ed. Als o, a centralized actor network out puts a set of actio n s {| A 1 | , . . . , | A m |} , each for one component as well as one set of a va ilable actions describing the decisions for the subsystem. This algorithm particularly extends the policy gradient algorithm to the cases with a lar ge number of d iscrete actions. Since the p roposed algorithm is in off-policy sett i ng, an important s am pling technique is applied to deal with this issu e. Under particul ar valid ass umptions on engineering sys tems, the proposed algorithm can be extended to t he case of Partially Obs ervable MPD (POMDP)s. In the numerical experiments, they cover a broad range of engineering sys tems including (i) a sim ple stationary parallel s eries M DP s y stem, (ii) a n on-stationary system with k-out-of-n mod es i n both 55 MPD and POMDP en vironments, and (iii) a bridge t russ system su bject to non-stati o nary corrosio n, simulated through an actual nonlinear structural model , in a POMDP en vi ron ment. The results prove the ef fectiv eness of the propos ed algorith m . 11 En vironments En vironments are core elem ents to t rain any no n batch-MARL algori thm. Basically , the en viron- ment provides the scope of the problem and differ ent p rob lems/en v i ronments have been t h e mot i- vation for dev eloping the new RL algorithms . Interacting with the real world is usual l y expensive, time-consumin g, or so m etimes imp o ssible. Thus, using the s i mulator of en vi ronments is the com- mon practice. Using simulato rs helps t o compare diffe rent algorithm s and i ndeed provides a frame- work to compare di fferent algorithm s. W ith these motiv ations, sev eral-single agent en vironm ents are de veloped. Among them, Arcade which provides Atari-2600 ( Bellemare et al. 2013 ), Mo- JoCo (simulates the detailed physics m ov ements of human and some anim als body) ( T odorov et al. 2012 ), OpenAI Gym which gathers t hese together ( Brockman et al. 2016 ), PyGame Learning En- vironment (similar to Arcade) ( T asﬁ 20 1 6 ), OpenSim (b uilds muscul oskeletal structu res of hu- man) ( Seth et al. 201 1 ), DeepMind Lab (3D navigation and puzzle-sol ving) ( Beattie et al. 2016 ), V iZDoom (3D Shooting and Navigation Doom using only the vi sual i nformation) ( Ke mpka et al. 2016 ), Malm o (based on Min ecraft) ( Johnso n et al. 2016 ), MINOS (Hom e indoo r 3D Navigation) ( Sa vva et al. 2017 ), House3D (3D Navigation in indoor area) ( W u et al. 2018 ), and MazeLab ( Zu o 2018 ) just are few to mention. Either of these en vironments at least include st and ard step and reset fun cti ons, such that env.re set() returns a random initial state and env.ste p( a t ) returns s t +1 , r t , d t , dict in which dict is some additional information. Thi s is a general structure which makes it possible t o reproduce a giv en trajectories with a given policy . In the multi-agent domain , there is a sm aller numb er of av ailable en vironments. In addition, there is a broad range of possi ble setti n gs for sharing informatio n amon g different agents. F or example, some en vi ronments in volve communication actions, som e share the joint reward, some share the global s tate and each of these cases need special al g orithms and not all of the algorithm s in Sec- tions 5 - 9 can be appli ed to each of these problems. Cons idering these li mitations, there is a small er number of en vironments in each setting. Among thos e, StarCraft II ( Samvelyan et al. 2019 ) has achie ved a lot of attention in th e recent years ( Foerster et al. 2017 , Usunier et al. 2017 , Singh et al. 2018 , Foerster et al. 2018 , Rashid et al. 2018 , Peng et al. 2017 ). In this game, each agent o nly observes its own local i nformation and recei ves a gl o bal rewar d, though d if ferent versions of the game with the globally observable state are also av ailable. Finding dynamic goals also has been a common benchm ark for bot h discrete and cont inuous action spaces. Mu lti-agent particle environ- ment ( M ordatch and Abbeel 2018a ) g at h ers a l ist of navigation tasks, e.g., the p redator -prey for 56 both discrete and continuous actions. Some of the games allow choosing a com munication action too. Harvest-gathering ( Jaques et al. 2019 ) is a s imilar game with comm unication-action. Neu- ral MMO ( Suarez et al. 2019 ) provides MMORPGs (Massively Mul t iplayer Online Role-Playing Games) en vironment like Pokemon in whi ch agents learn com bat and navigation whi le there is a large population of the sam e agents with the same goal. In th e area o f trafﬁc management, Zhang et al. ( 2019a ) provided Cit yFlo w , a new t raf ﬁc-signal en vironment to be used for MARL researches. In add ition, W u et al. ( 2017 ) introduced a frame work to control cars w i thin a mixed system of the human-like driver and AI agents. In t he same di rection as of t hose real-world like en- vironments, Lussange et al. ( 2021 ) i ntroduce a simulator of stock market for multi-agent systems. In additi on to the mention ed en vironments which are proposed by different papers, fe w projects gathered some o f the en vironments t ogether to provide a s i milar framework like OpenAI Gym for the multi-agent system. Jiang ( 2019 ) provide 12 en vironments navigation/maze-like games. Unity ( Juliani et al. 2018 ) is a platform t o dev elop singl e/multi-agent games which can be simple grid-world or quite complex s t rate gic games with multiple agents. The resulted game can be used as an environment for t raining machine learning agents. The framework supports cooperativ e and competitive m ulti-agent en vironm ent s. Unit y gives the abilit y to create any kind of multi- agent en vironment th at is intended; alt h ough it i s not speciﬁcally desi gned for m ulti-agent systems. Arena ( Song et al. 2020 ) extends the Unity engi n e and provides a speciﬁc platform to deﬁne and build new multi-agent gam es and scenarios based on the av ailable games. The frame work includes 38 multi -agent games from which 27 are ne w games. New scenarios o r games can be built on top of these av ailable gam es. Designing t he games in v olve a GUI-based conﬁgurable social tree, and re ward fun cti on can be selected from ﬁve re ward scheme that are proposed t o cover m ost o f t h e possible cases in comp etitiv e/cooperativ e games. Similarly , Ray platform ( Moritz et al. 2018 ) also recently started to support mul ti-agent en vironments, and some of the known alg orithms are also added in rllib repository ( Liang et al. 2017 , 2018 ). Ray supp o rts entering/ lea ving the agents from the problem whi ch is a commo n case i n trafﬁc control suites. 12 Poten tial Researc h Direc tions Off-policy MARL: In RL, off-policy refers to learning abou t a policy (called target policy), while the learning agent is beha ving under a differ ent policy (called behavior policy ). Th i s m ethod is of great interest because it can learn about an optimal policy whil e it explores and also learns about multiple pol icies only by fol lo wing a single policy . While the former advantage is helpful in both sing l e and multi-agent settings, the latter one seems to be more ﬁt for the multi-agent setting. In MARL literature, there exist a few algorith m s utilizin g the of f-policy ( Zhang et al. 57 2018b , Suttle et al. 2020 , Macua et al. 2 0 15 ); howe ver , there is sti ll room for research on off-policy in MARL algorithm design, t h eory , and appl i cations. Safe MARL: Safe RL is deﬁned as the t raining of agent to l earn a pol icy that maxim izes th e long- term cumulative discounted re ward while ensures reasonable performance in order to respect safety constraints and a void catastrophic sit u at i ons du ri n g the trainin g as well as execution o f the policy . The main approaches i n safe RL are based on introducing the risk concept in optimali ty conditions and regulating the exploration process to av oid undesirable actions. There i s numerous research in safe RL for single-agent RL. See a comprehensive revie w in Garcıa and Fernández ( 2015 ). Nonetheless, t he research on safe M ARL is very scarce. For e xample, i n Shale v-Shwartz et al. ( 2016 ), a safe RL alg o rithm is proposed for M ulti-Agent Autonomous Driving. Similarly , in Diddigi et al. ( 2019 ) a framew ork for the constrained cooperativ e multi-agent games is prop o sed, in which to ensure the safety a con s traints optim ization meth o d is ut i lized. T o solve the problem a Lagrangian relaxation along wit h the actor-critic algorith m is propos ed. Giv en the current limited research on this topic, another straightforward research direction for MARL would b e the Safe MARL in order to provide m ore applicable policies i n this setup. Heter ogeneous MARL: Most of th e works we s t udied above are hom ogeneous MARL, meanin g that all the agent s in the network are identi cal in terms of abil ity and ski ll. Howe ver , i n real-world applications, we likely face a multi-agent probl em where agents hav e different skills and abilities. Therefore, an additional problem here would be how dif ferent agents should utili ze the other agents’ abilities to learn a m ore efﬁcient policy . As a special case, consider the human-machine interaction. Particularly , hum ans are able to solve so me RL prob l ems very quickly us ing th eir experience and cognitive abilit ies. For example, in a 2D small space, th ey can ﬁnd a very good approximation of t he shortest path very quickly no matter ho w compl e x is the search space. On the other hand, machines have t he abi lity to sol ve more com plex probl ems in high-dim ensional spaces. Howe ver , optimalit y comes at the cost of com putational complexity so t hat oftentimes onl y a feasible solution is possible. The questi on that needs t o be answered in this problem is t h e following: Is it pos sible to dev elop MARL algorithm s that combine heterogeneous agents’ abilit i es towa rd maxi mizing the long-term gain? Moreover , can this be done in a principled way that com es with performance guarantees? Optimization in MARL : W ithout a doubt optimization is an indi spensable part o f t he RL prob- lems. Any progress i n optimization methods may l ead to more efﬁcient RL and in tu rn M ARL al - gorithms. In recent years, there has been a ﬂurry of research on designing optimizati o n algorit h ms for solving complex problems including noncon vex and nonsmoot h optimizati o n problems for multi-agent and dis trib uted sys tems ( Bianchi and Jakub owicz 2 012 , Di Lorenzo and Scutari 2016 , Hong et al. 2017 ). Howev er , the literature of MARL stil l lacks those algorithm s. Future research 58 directions on MARL from the op timization perspective can be di vided into two main branches: First, applying t he existing o ptimization algorithms (or adapt them when necessary) to mu l ti-agent problems. For i n stance, TRPO ( Schulman et al. 2015 ), which has been shown t o be very efﬁcient in single-agent RL problems, might be helpful for mult i-agent problems as well. Second, focusing on the theory part of th e alg orithms. Despite t he decent performance of the n umerical methods, which utilize the neural networks in MARL, there exists a huge gap between such num erical per- formance and s o me kin d of con vergence analysis. Therefore, this m ight be the tim e to t h ink o ut of the box and focus on the theory part o f the neural networks too. In verse MARL: One of the mos t vital component s of RL is r ewar d speciﬁcation. While in some problems such as games it is trivial, in many other applications pre-specifying reward function is a cumbersome p rocedure and may lead to poor results. In such circum stances, mo d eling a skill ful agent’ s behavior i s ut i lized for ascertaining the rew ard function. This is called In verse Reinf or ce- ment Learning . While thi s area attained signi ﬁcant attention i n the sing le-agent RL probl ems ( Arora and Doshi 2021 ), there i s no remarkable contribution regarding the i nv erse RL for M A RL. Therefore, a potential research avenue in MARL would be in verse MARL, how to deﬁne rele vant components, address the possib l e challenges, and extend it to the potenti al appli cations. Model-based MARL: Despit e the numerous success s tories for model-free RL and MARL, a very typical limi tation of these algorithm s is sample ef ﬁciency . Indeed, these algorit h ms require a tremendou s number of samples to reach good p erformance. On the other hand, model-based RL has been shown to be very successful in a great range of application s ( Moerland et al. 2020 ). For this type of RL algorithm s, ﬁrst, the en vironment model is learned, and then this model i s utilized for predicti o n and control. In single-agent RL, there exists a signiﬁcant amou n t of research regarding t he model-based RL methods ; see for inst ance Sun et al. ( 2019 ); howe ver , their extension to MARL has not been explored w i dely . Therefore, in ve stigating model-based MARL is another worthwhile research direction. 13 Conclusion In this re view , we cate gorize MARL algorithms into ﬁ ve groups , namely independent-learners, fully ob s erv able critic, value functi o n decompositi o n, consensus, and learn to comm u nicate. Then we provide an overvie w of the mo s t recent papers in t h ese classes. For each paper , ﬁrst, we hav e highlight ed t he problem setup, such as the av ailability of glo bal state, global action, rew ard, and the communication patt ern am o ng the agents. Then, we presented the key idea and the main steps of the proposed algorithm. Finally , we li s ted the en vironments wh ich ha ve been used for ev aluating the performance of the algorithm . In addition , among the broad range of applications of MARL for 59 real-world problems , we picked a few representative ones and showed how MARL can be utilized for so lving such complicated p rob lems. In T able 3 a sum m ary the o f m ost inﬂuential papers in each category is presented. In this su mmary , we have gathered the general sett ings of th e considered problem and the proposed algorit hm to show the gaps and the possible research di rections. For example, the table shows that wi t h value decompositio n (VD), there i s not any research that con siders t h e local states, the lo cal actions, and the local pol i cies. In t his t able, th e third column, com , shows th e com munication status , i.e., 0 means there is no com munication among the agents, and 1 otherwise. In the fourth column, A C means th e proposed algorithm is actor- critic based, and Q is used when the proposed al g orithm is v alue-based. In the ﬁfth column, con v st ands for the con vergence. Here, 1 is for the case that the con verge nce analysi s is provided and otherwise it is 0 . In the last three col umns tuple (T rn, Exe) stands for (Training and Execution) and G and L are for the global and local ava ilability respectiv ely , and determin e whether state, acti o n, and po l icy o f each agent is known to other agents. Refer ences Monireh Abdoos, Nasser Mozayani, and Ana LC Bazzan. T raf ﬁc li ght control in non-station ary en vi ronments based on multi agent q-learning. In 2011 14th Internati onal IEEE Confer ence on Intel l igent T ransportati on Systems (ITSC) , pages 1580 –1585, Oct 2011. doi: 10.1109/ITSC. 2011.6083114 . Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaura v Mahajan. Optimalit y and approxi- mation with po licy gradi ent methods in m arko v decisi on processes. In Confer ence on Learning Theory , pages 64–6 6 . PMLR, 2020. Adrian K Agogino and Kagan T umer . Unifying tem p o ral and structural credit assignment prob - lems. In Pr oceedings o f the Thir d Internatio n al Joint Confer ence on A utonomous Agents and Multiagent S ystems-V o lume 2 , pages 980–98 7. IEEE Compu t er Society , 20 0 4. Kyuree Ahn and Jinkyoo Park. Cooperativ e zone-based rebalancing of idle overhead hois t transportations using multi-agent reinforcement learning with graph representation learn- ing. IISE T ransactions , 0(0):1–17, 2021. doi: 10.1080 / 24725854.2020.185 1 823. URL https://d oi.org/10 .1080/24725854.2020.1851823 . Jacob Andreas, Marcus Rohrbach, Tre v or Darrell, and Dan Klein. Neural module networks. In Pr oceedings of the IEEE Confer ence on Computer V i sion and P attern Recognition , pages 3 9–48, 2016. CP Andriotis and KG Papakonstantinou. Managing engineering systems with large state and ac- tion spaces through deep reinforcement l earning. Reliab i lity Engineerin g & System Safety , 191: 60 State Action Re ward Reference Com ComLim AC /Q Conv ( T rn, Exe) (Trn, Exe) (Trn, Exe) IQL T an ( 1993 ) 0 0 Q 0 (L,L) (L,L) (G,G) Lauer and Riedmiller ( 2000 ) 0 0 Q 0 (G,G) (L,L ) (G,G) Matignon et al. ( 200 7 ) 0 0 Q 0 (G,G) (G,L) ( G,G) T amp uu et al. ( 20 17 ) 0 0 Q 0 (G,G) (L,L) (G,G) Omidshaﬁei et al. ( 201 7 ) 0 0 Q 0 (G,G) (G,L) (G,G) Fuji et al. ( 201 8 ) 0 0 Q 0 (G,G) (G,L) ( G,G) Fully Observable Critic W ang e t al. ( 2 019 ) 1 1 A C 0 ( G,L) (G,L) (L,L) Foerster et al. ( 2018 ) 0 0 AC 0 (G,L) (G,L) (G, G ) Ryu et al. ( 20 1 8 ) 0 0 AC 0 (G,L) (G,L) (L/G,L/G) Sartoretti et al. ( 2019 b ) 0 0 AC 0 (G,G) (L ,L) (L,L) Chu and Y e ( 2017 ) 0 0 AC 0 (L,L) (L,L) (L/G,L /G) Y ang et al. ( 202 0 ) 0 0 A C 0 (L,L) (L,L) (L,L) Jiang et al. ( 20 20 ) 1 0 Q 0 (L,L) (L,L) (L ,L) Iqbal and Sha ( 20 1 9 ) 1 0 A C 0 (G, L) (G,L) (G,L) Y ang et al. ( 201 8a ) 1 0 A C/Q 1 (G,G) (G,L) (G,L) Kim et al. ( 201 9 ) 1 1 AC 0 (G,L) (G,L) (G, G ) Lowe et al. ( 20 17 ) 0 0 AC 0 (G,L) (G,L) (L,L) Mao et al. ( 2019 ) 0 0 AC 0 (G,L) (G,L) (L,L) VD Rashid et al. ( 201 8 ) 0 0 Q 0 (G,L) ( G,L) (G,G) Sunehag et al. ( 201 8 ) 0 0 Q 0 (L,L) (L,L) (G,G) Mguni et al. ( 20 1 8 ) 0 0 Q 1 (L,L) (L,L) (G,G) Son et al. ( 2019 ) 0 0 Q 1 (L,L) (L,L) (G,G) Consensus Zhang et al. ( 2018 c ) 1 0 AC 1 (G,G) (G,L) (L,L ) Kar et al. ( 20 1 3a ) 1 0 Q 1 (G,G) (L,L ) (L,L) Lee et al. ( 20 1 8 ) 1 0 Q 1 (G,G) (L,L ) (L,L) Macua et al. ( 201 5 ) 1 0 Q 0 (L,L) (L,L) (G,G) Macua et al. ( 201 8 ) 1 0 AC 0 (L,L) (L,L) (L,L) Cassano et al. ( 202 1 ) 1 0 Q 1 (L/G,L/G) (L, L) (L/G,L/G) Zhang et al. ( 2018 b ) 1 0 AC 1 (G,G) (G,L) (L,L ) Zhang and Zavlanos ( 20 19 ) 1 0 AC 1 (G,G) (G,G) ( L,L) Learn to c omm V a r sha vskaya et al. ( 2009 ) 1 0 A C 1 (L,L) (L,L) ( L,L) Peng et al. ( 2017 ) 1 0 AC 0 (G,G) (G,G) ( L,L) Foerster et al. ( 2016 ) 1 0 Q 0 (L,L) (L,L) (G,G) Sukhbaata r et al. ( 201 6 ) 1 0 AC 0 (L,L) (L,L) (G,G) Singh et al. ( 201 8 ) 1 0 A C 0 (L, L) (L,L ) (L,L) Lazaridou et al. ( 201 7 ) 1 0 AC 0 (G,G) (L ,L) (G,G) Das et al. ( 2017 ) 1 0 AC 0 (L,L) (G,G) (L,L) T able 3: The proposed a lg orithms for MARL and the r ele vant setting. A C stand s for all actor-critic and policy gradien t- based algor ithms, Q represents any value-based alg o rithm, Com stands fo r communic a tion, Co m = 1 means the agen ts commun icate directly , and Com = 0 means o therwise, ComLim stand s f or commu nication ba ndwidth lim it, ComLim = 1 mean s there is a limit o n th e bandwidth, an d ComL im = 0 means other w ise, Con v stands for convergence, and Con v = 1 m eans there is a convergence ana ly sis for the propo sed method, an d Conv = 0 means oth erwise, the tup le (T rn,Exe) shows the way that state, reward, or a c tio n ar e shared in (training , execution), e.g. , (G, L) und er state me a ns that durin g the train in g the state is ob serv able glo b ally and during the execution, it is only accessible locally to each agent. 106483, 2019. Itamar Arel, Cong Liu , T om Urbanik, and Airton G K ohls. Reinforcement learning-based multi- agent sys t em for network trafﬁc signal con t rol. IET Intelli gent T ransport Systems , 4(2):128–135 , 2010. 61 Saurabh Arora and Prashant Doshi . A s u rve y of in verse reinforcement learning: Challenges, meth- ods and progress. Artiﬁcial Intelligence , page 103500, 2021. Joseph Arrow Arrow , Leonid Hurwicz, and Hirofum i Uzaw a. Studies in Linear and Non-linear Pr ogr amming . Stanford Univ ersity Press, 1958. W enhang Bao and Xiao-yang Liu. Multi-agent deep reinforcement learning for liquid at i on strategy analysis. In W orkshops at the Thirty-Sixth ICML Confer ence on AI in F inance , 20 1 9. Nolan Bard, Jakob N Foerster , Sarath Chandar , Neil Burch, Marc Lanctot, H Francis Song, Emi lio Parisotto, V i ncent Dumoulin , Subhodeep M o itra, Edward Hu g hes, et al. The hanabi challenge: A n e w frontier for ai research. Artiﬁcial Int elligence , 28 0:103216, 2020. Max Barer , Guni Sharon, Roni Stern, and Ariel Felner . Subopt imal variants of the conﬂict-based search algorithm for the multi-agent pathﬁnding probl em . In Se venth An nual Symposi um on Combinatorial Sear ch . Citeseer , 2014. Charles Beattie, Joel Z Leibo , Denis T eplyashin, T o m W ard, Marcus W ain wri ght, Heinrich Küt tler , Andrew Lefrancq, Simon Green, Víct o r V aldés, Amir Sadik, et al. Deepmind lab . arXiv pre print arXiv:1612.03801 , 2016 . Marc G Bellemare, Y av ar Naddaf, J oel V eness, and Michael Bowling. The arcade learning en vi- ronment: An ev aluation platform for general agents. Journal o f Ar t iﬁcial Intelligence Resear ch , 47:253–279, 2013. Richard Bellman. A markovian decision process. Journal of mat hematics and mechanics , pages 679–684, 1957. Daniel S Bernstein, Robert G ivan, Neil Immerman, and Shlomo Zilb erstein. The complexity of decentralized control of markov decision processes. Mathemati cs of operations r esear c h , 27(4): 819–840, 2002. Dimitri P Bertsekas and John N Tsitsiklis. Neur o-Dynamic Pro gramming . Athena Scienti ﬁc, Belmont, M A, 1996. Shalabh Bhatnagar , Doi n a Precup, David Silve r , Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. Con vergent tempo ral-di f ference learning w i th arbit rary smoo t h function approxi m a- tion. In Advances in Neural Information Proce ssing Systems , pages 1204–1212, 200 9 . Pascal Bianchi and Jérémie Jakubowicz. Con ver gence of a multi-agent projected s t ochastic gra- dient algo rithm for non-con ve x optim ization. IEEE T ransactions on Automatic Contr ol , 5 8 (2): 391–405, 2012. V iv ek S Borkar and Sean P Meyn. The o.d.e. method for con vergence of stochastic approximat i on and reinforcement learning. SIAM Journal on Contr ol and O p t imization , 38(2):447–469 , 2000. 62 Michael Bowling and Manuela V eloso. Multiagent learning us i ng a variable learning rate. Ar tiﬁcial Intelligence , 136(2):215– 2 50, 20 02. Marc Brittai n and Peng W ei. Autonomous air t raf ﬁc controller: A deep multi-agent reinforcement learning approach. In Reinfor cement Learning for Real Lif e W orkshop in the 3 6 th Interna tional Confer ence on Machine Learning, Long Beach , 2019. Greg Brockman, V icki Cheung, Ludwig Pettersson, Jonas Schneider , John Schulman, Jie T ang, and W ojciech Zaremba. Openai g y m, 2016. Lucian Bu, Robert Babu, Bart De Schutter , et al. A comprehensive s urve y o f multiagent reinforce- ment learning. IEEE T ransactions on Systems, Man, and Cybernetics, P art C (Applications and Revie ws) , 38(2):156– 1 72, 20 0 8. Lucian Bu ¸ soniu, Robert Babuška, and Bart De Schutter . Multi-agent reinforcement l earning: An overvie w . In Innovations in multi-agent systems and application s -1 , pages 183–221. Springer , 2010. Michal Cáp, Peter Novák, Martin Seleck ` y, Jan Faigl, and Ji ff V okffnek. Asynchronous decentral- ized p rioritized planning for coordination in m u lti-robot system. In 2013 IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 3822–3829. IEEE, 2013 . L. Cassano, K. Y uan, and A. H. Sayed. Multi agent fully decentralized value functi on learning wit h linear con ver gence rates. IEEE T ransa ct i ons on Automatic Contr ol , 66(4):1497–1512 , 2021 . doi: 10.1109/T A C.2020.2995814. Hua W ei Chacha Chen, Nan Xu, Guanjie Zheng, M ing Y ang, Y uanhao Xio ng, Kai Xu, and Zhenh ui Li. T owa rd a thous and li g hts: Decentralized deep reinforcement learning for lar ge-scale trafﬁc signal con t rol. In Pr oceedings of t h e Thirty-F ourth AAAI Confer ence on Artiﬁcial Intelligence , 2020. Jianshu Chen and Ali H Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE T ransacti ons on Signal P rocessing , 60(8):4 289–4305, 2012. Y u Fan Chen, Miao Liu, Michael Everett, and Jonathan P How . Decentralized non-communicati ng multiagent col lision av oidance with deep reinforcement learning. In 2017 IEEE i nternational confer ence on r obotics and aut o mation (ICRA) , pages 285–29 2. IEEE, 201 7 . Kyunghyun Cho, Bart van Merrienboer , Çaglar Gülçehre, Dzm itry Bahdanau, Fethi Bougares, Holger Schwenk, and Y oshua Bengio . Learning phrase representations us ing rnn encoder - decoder for stati stical machin e translation. In EMNLP , pages 1 7 24–1734, 2014. URL http://ac lweb.org/ anthology/D/D14/D14- 1179.pdf . 63 Jinyoung Choi, Beom-Jin Lee, and Byoung-T ak Zhang . Multi-focus attentio n network for efﬁcient deep reinforcement l earning. In W orkshop s at the Thirty-F irst AAAI Conf ere nce on Artiﬁcial Intelligence , 2017. T ianshu Chu, Jie W ang, Lara Codecà, and Zhaojian Li. Mul ti-agent deep reinforcement learning for large -scale traf ﬁc signal cont rol. IEE E T ransacti ons o n Intelli gent T ransportation Systems , 21(3):1086–1095 , 2019 . Xiangxiang Chu and Hangjun Y e. Parameter sharing deep deterministic p o l icy gradient for coop- erativ e multi-agent reinforcement learning. arXiv pr eprint arXiv:1710.00336 , 2 017. Kamil Ciosek and Shimon Whiteson. Expected policy gradients for reinforcement learning. J ournal of Machine Learning Resear c h , 21(52):1–51, 2020. URL http://jm lr.org/pa pers/v21/18- 012.html . Rongxin Cui, Bo Gao, and Ji Guo. Pareto-optimal coordination of mu l tiple robots with safety guarantees. A utonomous Robot s , 32 (3): 1 89–205, 2012. Felipe Leno Da Silva and Anna Helena Reali Costa. A survey on transfer learning for mult iagent reinforcement learning system s. J ournal of Arti ﬁcial Int elligence Resear ch , 64: 645–703, 2019. Nikolay Dandanov , Hussei n Al-Shatri, Anja Klein, and Vladimir Poulkov . Dynamic self- optimizatio n of the antenna t i lt for best trade-of f between coverage and capacity i n mobile net- works. W ir eless P ersonal Communicatio n s , 92(1): 251–278, 2017. Abhishek Das, Satwik Kottur , José M F M oura, Stefan Lee, and Dhruv Batra. Learning cooperativ e visual dialog agents wit h deep reinforcement learning. In Pr oceedings of the IEEE Int ernational Confer ence on Computer V ision , pages 2951–2 960, 2 017. Abhishek Das, Théophi le Gervet, Joshua Romof f, Dhruv Batra, De vi Parikh, Mike Rab- bat, and Joelle Pineau. T arMA C: T ar geted m ulti-agent communication . In Kam a- lika Chaudhuri and Ruslan Salakhutd inov , editors, Pr oceedings of the 36th Int ernational Confer ence on Machine Learning , volume 97 of Pr o ceeding s of Mac hine L earning Re- sear c h , p ages 1538 – 1546, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://pr oceedings .mlr.press/v97/das19a.html . Sam Devlin and Daniel Kudenko. Theoretical cons iderations of pot ent ial-based rewa rd shaping for multi-agent systems . In The 10t h Intern a tional Confer ence on Autonomous Agents an d Mul t ia- gent Systems-V olume 1 , pages 225–232. Internation al Foundation for Autonomous Agents and Multiagent System s, 2011. Sam Devlin, Lo g an Yliniemi, Daniel Kudenko, and Kagan T umer . Potential-based diffe rence re wards for multiagent reinforcement learning. In Pr oceedings of the 2014 international confer- 64 ence on Autonomous agents and mu l ti-age nt systems , pages 165–172. Int ernational Foundation for Aut onomous Agents and M ultiagent Systems, 2014. Paolo Di Lorenzo and Gesuald o Scutari. Next: In-network noncon ve x optim ization. IEEE T rans- actions on Signal and Information Pr ocessing over Networks , 2(2):120– 136, 2016. Raghuram Bharadwaj Diddigi , Sai K oti Reddy Danda, Shalabh Bhatnagar , et al. Actor-c ritic al- gorithms for constrained m ulti-agent reinforcement learning. arXiv pr eprint arXiv:1905.0290 7 , 2019. Marc-André Dittrich and Silas Fohlmeister . Cooperativ e m u lti-agent system for produ ct i on control using rein forcement learning. CIRP Ann a ls , 69(1):389–3 9 2, 202 0 . Adam Eck, L een-Kiat Soh, Sam De vlin, and Daniel Kudenk o. Potential-based re ward shaping for ﬁnite horizon o nline pomd p planni ng. Autonomous Agents and Multi-Agent Systems , 30(3): 403–445, 2016 . W i l liam Fedus, Prajit Ramachandran, Rishabh Agarwal, Y oshua Bengio, Hugo Larochelle, Mark Ro wland, and W ill Dabney . Re visiting fundamentals of experience replay . In Internati onal Confer ence on Machine Learning , p ages 3061 – 3071. PMLR, 202 0. Jakob Foerster , Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with d eep multi -agent reinforcement learning . In Advances in Neural Info r mation Pr ocessing S ys t ems , pages 2137–2145, 2016 . Jakob Foerster , Nantas Nardelli, Gregory Farquhar , T riantafyllos Afouras, Philip HS T orr , Push- meet K ohli, and Shimon Whiteso n. Stabilising experience replay for deep m u lti-agent rein- forcement learning. In Pr oceedings of the 34 th Interna tional Confer ence on Machine Learning- V olume 70 , pages 1146–1155. JM LR. org, 2017. Jakob N Foerster , Gregory Farquhar , T riantafyllos Afouras, N ant as Nardelli, and Shimon White- son. Counterfactual multi -agent policy gradi ent s. In Thirty-Second AAAI Confer ence on Artiﬁ- cial Intell igence , 2018. Benjamin Freed, Guill aum e Sartoretti, Jiaheng Hu, and Ho wie Choset. Commu n ication learning via backpropagation in dis crete channels with unknown noise. In Pr oceedings of the AAAI Con- fer ence on Artiﬁcial Intell igence , volume 34, pages 7160–7168, 202 0. T aiki Fuji, Ki y oto Ito, K ohsei Matsum o to, and Kazuo Y ano. Deep multi -agent reinforcement learning using dn n-weight ev olution t o opti mize supply chain performance. In Pr oceedings of the 51st Hawaii Internation a l Confer ence on System Sciences , page 8, 2018. Scott Fujimoto, Herke Hoof, and David Meger . Addressing function approximati o n error i n actor- critic methods. In Internatio n al Confer ence on Machine Learning , pages 1587–1 596. PMLR, 2018. 65 Thomas Gabel and Martin Riedmiller . On a s uccessful applicati o n of multi-agent reinforcement learning to operations research benchmarks. In 20 07 IEEE Internat ional Sympos ium o n Appr ox- imate Dynami c Pr ogramming and R ei n for cement Learning , pages 68–75. IEEE, 2007. Qitong Gao, Da v ood Hajinezhad, Y an Zhang, Y i annis Kantaros, and Michael M Zavlanos. Re- duced var iance deep reinforcement learning with tem p o ral logic speciﬁcations. In Pr oceedings of the 10th ACM/IEEE Interna tional Confer ence on Cyber -Physical Systems , pages 237–248. A CM, 2019 . Javier Garcıa and Fernando Fernández. A comprehensiv e survey on safe reinforcement learning. J ournal of Machine Learning Resear ch , 16 (1):1437–1480, 2015. Mevludin Gla vic, Raphaël Fonteneau, and Dam i en Ernst. Reinforce ment learning for electric power system decision and cont ro l : Past consi derations and perspectives. IF A C-P apersOnLine , 50(1):6918–6927 , 2017 . Y aobang Gon g , Mo hamed Abdel-Aty , Qing Cai, and Md Shariku r Rahman. Decentralized net- work level adaptive si gnal control by mult i-agent deep reinforcement learning. T ransportati on Resear c h Inter disciplinar y P erspectives , 1:10002 0, 201 9 . T u o mas Haarnoja, Aurick Zhou, Pieter Abbeel, and Ser gey Levine. Soft actor -critic: Of f-policy maximum entropy deep reinforcement learning wi th a stochastic actor . In Int ernational Confer- ence on Machine Learning , pages 1861–1870. PMLR, 2018. Hado V Hasselt. Double q-learning. In Advances in Neural Informatio n Proc essing Systems , pages 2613–2621, 20 1 0. Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015 AAA I F all Symposium Series , 201 5. Magnus Rudolph Hestenes, Eduard Stiefel, et al. Method s of conjugate g radients for solving linear systems , volume 49 . NBS W ashington, DC, 1952 . Sepp Hochreiter and Jürgen Schmidhuber . Long sh o rt-term mem o ry . Neural comput ation , 9(8): 1735–1780, 19 9 7. Chris HolmesParker , M atthe w E. T aylor , Y usen Zhan, and Kagan T umer . Ex p loiting structu re and agent-centric rewards t o promote coordinati on in large m u ltiagent sys t ems. In Adaptive and Learning Agents W orkshop , 2014. Mingyi Hong, Dav ood Hajinezhad, and Ming -M in Zhao. Prox-pda: The proxim al primal-dual al- gorithm for fast d istributed noncon ve x optim ization and learning over networks. In Pr oceedings of the 34th Interna t ional Confer ence on Machine Learning-V olume 70 , pages 152 9–1538. JMLR. org, 2017. 66 Y edid Hoshen. V ain: At tentional multi -agent predictive m odeling. In I. Guyon, U. V . Lux b ur g, S. Bengio, H. W allach, R. Fer gus, S. V is h wanathan, and R. Garnett, edit ors, Ad van ces i n Neural Information Processing Systems 30 , pages 2701–2711. Curran Associates, Inc., 2017. URL http://pa pers.nips .cc/paper/6863- vain- attentional- m ulti- agent- predictive- modeling.pdf . Gao Huang, Zhuang Liu, Laurens V an Der Maaten, and Kili an Q W ei n ber ger . Densely connected con volutional net works. In Proc eedings of the IEE E confere nce on comput er vision a nd pattern r eco gnition , pages 4700–4 708, 2 0 17. Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Kamalika Chaudhuri and Rusl an Salakhutdinov , editors, Pr oceedings of the 36th Interna - tional Confer ence on Ma chine Learning , v olume 97 of Pr oceedings of Machine Learning Re- sear c h , p ages 2961 – 2970, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://pr oceedings .mlr.press/v97/iqbal19a.html . Eric Jang, Shixiang Gu , and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR , 2016. Natasha J aques, Angeli k i Lazaridou, Edward H u ghes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z. Leibo, and Nando De Freitas. Social inﬂuence as intrinsic motiv ation for multi-agent deep reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov , editors, Pr o- ceedings of the 36th International Confer ence on Machine Learnin g , volume 97 of Pr oceedings of Machine Learning Resear ch , pages 3040–3049, L o ng Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://pr oceedings .mlr.press/v97/jaques19a.html . Jiechuan Jiang and Zongqing Lu. Learning attent ional com munication for m ulti-agent cooperation. In A d vances in Neural Information Processing Systems , pages 7254–7264, 20 1 8. Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. Graph con v olutional reinforce- ment learning. In Internat ional Confere nce on Learning Repr esentations , 2020. URL https://o penreview .net/forum?id=HkxdQkSYDB . Shuo Jiang. Multi-Agent Reinforcement Learning En vironment. https://g ithub.com /Bigpig4396/Multi- Agent- Reinforce ment- Learning- En vironment , 2019. Accessed: 20 19-07-28. Matthew Johnson, Katja Ho fmann, Tim Hutton, and Da vid Bignell. The malmo platform for artiﬁcial i ntelligence experimentation. In IJCAI , pages 4 2 46–4247, 2016. Emilio Jorge, M ikael Kågebäck, Fredrik D Johans s on, and Em il Gu s ta vsson. Le arning to play guess who ? and i n venting a groun ded language as a cons equence. arXiv p reprint arXiv:1611.03218 , 2016 . 67 Arthur Ju liani, V incent-Pierre Ber ges, Esh Vckay , Y uan Gao, Hunter H enry , M arw an Mat t ar , and Danny Lange. Unit y: A general platform for intell igent agents. arXiv pr eprint a r Xiv:1809.02627 , 2018. Soummya Kar , José MF Mo ura, and H V incent Poor . QD -learning: A collaborative distributed strategy for multi -agent reinforcement learning t hrough cons ensus + innovations. IEEE T rans- actions on Signal Pr ocessing , 61(7):1848–18 62, 20 1 3a. Soummya Kar , José MF Moura, and H V i n cent Poor . Distributed reinforcement learning in multi - agent networks. In 20 13 5th IEEE Int ern ational W o r kshop on Comput a tional Advances in Multi- Sensor Ad a ptive Pr ocessing (CAMSAP) , pages 296– 2 9 9. IEEE, 2013b . T atsuya Kasai, Hi roshi T enmoto, and Akim oto Kamiya. Learning of communication codes in multi-agent reinforcement learning problem. In 2008 IEEE Confer ence o n Soft Computing in Industrial Ap plications , pages 1–6. IEEE, 2008. Michał Kempka, Marek W y d much, Grzegorz Runc, Jakub T oczek, and W ojciech J a ´ sko wski. V iZ- Doom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Confer- ence on Computation a l Intelli gence an d Games , pages 341 – 348, Santori n i, Greece, Sep 2016. IEEE. The best paper award. Dae woo Kim, Sangwoo Moon, David Hostallero, W an Ju Kang, T aeyoung Lee, Kyungh- wan Son, and Y un g Y i. Learning t o schedul e comm unication in mul ti-agent reinforce- ment learning. In Internat ional Confere nce on Learning Repr esentations , 2019. URL https://o penreview .net/forum?id=SJxu5iR9KQ . Jens K ober , J An d rew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey . The International J ournal of Roboti cs R es earc h , 32(11 ):1238–1274, 2013. Dirk P Kroese and Reuven Y Rubinstein. M o nte carlo methods . W ile y Inter disciplinary Reviews: Computational Stat i stics , 4(1):48–58, 2012. T ejas D Kulkarni, Karthik Narasimhan, Ardav an Saeedi, and Jos h T enenbaum. Hierarchical deep reinforcement learning: Integrating temp o ral abstraction and intrin sic m oti vation. In Advances in neural information pr ocessing s yst ems , pages 3675 – 3683, 2016. Marc Lanctot, V ini cius Zambaldi, Au drunas Grusl ys, Angeliki Lazaridou, Karl T uyls, Ju l ien Péro- lat, Da vid Silver , and Thore Graepel. A uniﬁed game-theoretic approach to m ultiagent rein- forcement learning. In Advances in Neural Information Pr o cessing Systems , pages 4 1 90–4203, 2017. Martin Lauer and Martin Riedmiller . An algorithm for di s trib uted reinforcement learning in coop- erativ e m ulti-agent sys tems. In In Pr oceedings of the Seventeenth Internation a l Confer ence on Machine Learning . Citeseer , 2000. 68 Ste ven M LaV alle. Pla nning algorithms . Cambridge un i versity press, 2006. Alessandro Lazaric. Transfer in reinforcement learnin g : a framew ork and a survey . In Reinfo rce - ment Learning , pages 143–173. Springer , 2012. Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Reinforcement learning in cont i nu- ous action spaces through sequent ial monte carlo methods. In Advances in neural informati on pr ocessing s ystems , pages 833–840, 2008. Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi -agent cooperation and the emer gence of (natural) language. In ICLR , 2017. Donghwan Lee, Hyung-Jin Y oon, and Naira Hovakimyan. Primal-dual algorit hm for distributed reinforcement learning: Distrib uted GTD. 2018 IEEE Conf er ence on Decision and Contr ol (CDC) , pages 1967 – 1972, 2018. Jae W on Lee, Jo nghun Park, O J angmin, Jongwoo Lee, and Euyseok Hong. A multiagent approach to q -learning for daily stock t rading. IEEE T ransactions on Syst ems, Man, and Cybernetics-P art A: S yst ems and Humans , 37(6):864 –877, 2007. Joel Z Leibo, V ini ci u s Zamb aldi, M arc Lanctot, Janusz Marecki, and Thore Graepel. Mul t i-agent reinforcement learning in sequential so ci al dil emmas. In Pr oceedings of the 16th Confer ence on Autonomous Agents and MultiAgent Systems , pages 464–473. International Foundation for Autonomous Agents and Multi agent System s, 2017. Stephane Leroy , Jean-Paul Laum ond, and Thierry Siméon. Mul t iple path coordin at i on for m o bile robots: A geometric algorit hm. In IJCAI , volume 99, pages 1118–1123, 1 999. Y uxi Li. Deep reinforcement learning: An overvie w . arXi v pr eprint arX iv:1701.07274 , 2 017. Eric Li ang, Richard Liaw , Robert Ni s hihara, Philip p Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica. Ray RLlib : A com posable and scalable reinforcement l earning library . In Deep Reinforce ment Learning sympos ium (DeepRL @ NeurIPS) , 2017. Eric Li ang, Richard Liaw , Robert Nishihara, Philipp Moritz, Roy Fox, Ken Gold ber g, Joseph Go nzalez, Mi chael Jordan, and Ion Stoica. RLlib: Abstractions for dis- tributed reinforcement learning. In Jenni fer Dy and Andreas Krause, editors, Pr oceed- ings of the 3 5 th Internat i onal Confer ence on Machine Learning , volume 80 of Procee d- ings of Machine Learning Resear ch , pages 3053–306 2 . PMLR, 10–15 Jul 2018. URL http://pr oceedings .mlr.press/v80/liang18b.html . T imothy P . Lillicrap, Jonath an J. Hunt, Alexander Pritzel, Nicolas Heess, T om Erez, Y u val T ass a, Da vid Silver , and Daan W ierstra. Conti n uous control with deep reinforcement learning. In ICLR (P oster) , 2016. 69 Kaixiang Li n, Renyu Zhao, Zhe Xu, and Jiayu Zhou. Efﬁcient large-scale ﬂeet management via multi-agent d eep reinforcement learning. In Pr oceedings of the 24th A CM SIGKDD Interna- tional Confer ence on Knowledge Discovery & Data Mi ning , pages 1774–1783. A CM, 2018. Long-Ji Lin. Self-im p roving reacti ve agents based on reinforcement l earning, planning and teach- ing. Mac hine learn ing , 8(3-4):293–32 1, 199 2 . Zachary C L i pton, Jianfeng Gao, Lihong Li, Xiujun Li , Faisal Ahmed, and Li Deng. Efﬁcient exploration for dialog poli cy learning with deep bbq networks & replay buf fer spiking. CoRR abs/1608.05 0 81 , 2016. Boyi Liu, Qi Cai, Zhuoran Y ang, and Zhaoran W ang. Neural trust region/proxi mal policy optim ization attains glob ally optimal po l icy . In H. W allach, H. Larochelle, A. Be ygelzimer , F . d’ Alché-Buc, E. F ox, and R. Garnett, editors, Advances in Neu- ral Information Pr ocessing Systems , volume 32. Curran Associat es, Inc., 2019. URL https://p roceeding s.neurips.cc/paper/2019/file/227e072d131ba77451d8f27ab9afdfb7- P aper.p d f . Ruishan Liu and J ames Zou. The eff ects of memory replay in reinforcement learning. In 2018 56th Annual A l lerton Confer ence on Communi cat ion, Contr ol, and Computing (All erton) , p ages 478–485. IEEE, 2018. Y ing Li u, Brent Logan, Ning Li u, Zhiyuan Xu , Jian T ang, and Y angzhi W ang. Deep reinforcement learning for dy n amic treatment regimes on m edical registry d ata. In 2017 IEEE Internatio nal Confer ence on Healthcar e Informatics (ICHI) , pages 380–385. IEEE, 201 7 . Ryan Lowe, Y i W u, A viv T amar , Jean Harb, OpenAI Pieter A b beel, and Igo r Mordatch. Multi- agent actor-critic for mixed coop erative-competitiv e en vironments. In Advances in Neural Infor- mation Pr ocessi ng Syst ems , pages 6382–6393, 201 7 . Johann Lussange, Ivan Lazarevich, Sacha Bourgeois-Gironde, Stefano Palminteri, and Boris Gutkin. Modelling stock markets by mult i -agent reinforcement learning. Computational Eco- nomics , 57 (1):113–147, 2021. Hang Ma, Craig T ovey , Guni Sharon, TK Kumar , and Sven K oenig. Mu l ti-agent path ﬁnding wit h payload transfers and t he package-exchange robot -routing problem. In Pr oceedings of the AAA I Confer ence on Artiﬁcial Intell igence , volume 30, 2016. Hang Ma, Daniel Harabor , Peter J Stuckey , Jiaoyang Li, and Sven Koenig. Searching wit h con- sistent prioritizati o n for multi -agent path ﬁnding. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intel l igence , volume 33, pages 7643–7650, 201 9. Ser gio V alcarcel M acua, Jianshu Chen, Santi ago Zazo, and Al i H Sayed. Distributed poli c y ev alu- ation under multiple beha vior strategies. IEEE T ransactions on A utomatic Contr ol , 60(5):1260– 1274, M ay 2015. ISS N 0018-9286. doi: 1 0 .1109/T A C.2014.23 68731. 70 Ser gio V alcarcel Macua, Al eks i T ukiainen, Daniel García-Ocaña Hernández, D avid Baldazo, En- rique Munoz de Cote, and Santiago Zazo. Diff -dac: Distributed actor-critic for ave rage mul t itask deep reinforcement learning. In A d a ptive Learning A gents (ALA) Confer ence , 2018. Rajbala Makar , Sridhar Mahadev an, and Mohamm ad Ghav amzadeh. Hierarchical multi-agent reinforcement learning. In Pr oceedings of the ﬁfth intern ational confer ence on Autonomous agents , pages 2 46–253. A CM, 2001. Hangyu Mao, Zhengchao Zhang, Zhen Xiao, and Zhibo Gong. Modelling the dy namic j oint pol- icy of teammates wi th attention m ulti-agent ddpg. In Proce edings of the 18th Intern a tional Confer ence on A utonomous Agents and Multi Agent Systems , pages 1108–1116. Internation al Foundation for Au t onomous Agents and M ultiagent Systems, 2019. Laëtitia Matigno n, Guillaume Laurent, and Nadine Le Fort-Piat. Hysteretic q-learning: an algo- rithm for decentralized reinforcement learning in cooperativ e mult i-agent teams. In IEEE/RS J International Confer ence on Intelligent Rob o ts an d Systems, IR O S ’07. , pages 64–69, 2 0 07. Laetitia Matignon , Guill aume J Laurent, and Nadi ne Le Fort-Piat. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. The Knowl- edge Engineerin g Revie w , 27(1):1– 31, 2012. Da vid Mgun i , Joel J ennings, Sergio V alcarcel Macua, Soﬁa Ceppi, and Enrique Munoz de Cote. Controlling the cro wd: Inducing ef ﬁcient equil ibria in multi -agent systems. In Advances in Neural Information Pr ocessi ng Syst ems 2018 MLITS W orkshop , 2018. V olodym yr Mnih, K oray Kavukcuoglu, David Silver , Alex Gra ves, Ioannis Antonoglo u , Daan W i erst ra, and Martin Riedmiller . Playing atari with deep reinforcement l earning. A d vances in Neural Information Pr ocessi ng Syst ems, Deep Learni ng W orkshop , 201 3 . V olodym yr Mnih, Koray Kavukcuoglu, David Silver , Andrei A Rusu, J oel V eness, Marc G Belle- mare, Alex Gra ves, Martin Riedmiller , Andreas K Fidj el and , Geor g Ostrovski, et al. Human- lev el control through d eep reinforcement learning. Natur e , 518(7540):529–53 3 , 2015 . V olodym yr Mnih, Adria Puigdom enech Badia, Mehdi Mirza, Alex Grav es, Timothy Lillicrap, Tim Harley , David Sil ver , and K oray Ka vukcuoglu. Asynchronou s metho ds for deep reinforcement learning. In Interna t ional confer ence on machine learning , pages 192 8–1937, 2016. Thomas M M oerland, Joos t Broekens, and Cathol ijn M Jonker . Model-based reinforcement learn- ing: A surve y . arXiv pr eprint arXiv:2006.1 6 712 , 2020. Igor M ordatch and Pieter Abbeel. Emergence o f grounded composi tional language in multi-agent populations. In Proce edings of the AAAI Confer ence on Artiﬁcial Intelligence , volume 32, 2018a. Igor M ordatch and Pieter Abbeel. Emergence o f grounded composi tional language in multi-agent populations. In Thirty-Second AAAI Confer ence on Artiﬁcial Int elligence , 20 1 8b. 71 Philipp Morit z, Robert Nishihara, Stephanie W ang, Alexe y T um anov , Richard Liaw , Eric Liang, Melih Eli b o l, Zongheng Y ang, W illiam Paul, Michael I Jordan, et al. Ray: A distributed frame- work for emerging { AI } appl ications. In 13th { USENIX } Symposium on Operating Systems Design and Implement a tion ( { OSDI } 18) , pages 561–577, 2018. H. K. Mousavi, M. Nazari, M. T aká ˇ c, and N. M otee. Multi -agent image class iﬁcation via rein force- ment learnin g . In 2019 IEEE/RS J Internati onal Confer ence on Intelligent Rob ots and Systems (IR OS) , pages 5020–502 7, 201 9 . doi: 10.1109/ IR OS40897.2019.89 68129. Hossein K. Mousavi, Guangyi Li u , W eihang Y uan, Martin T akác, Héctor Muñoz-A vila, and N ader Motee. A layered architecture for active perception: Image classiﬁcation using deep reinforce- ment learning . CoRR , abs/1909.0970 5 , 2019 . URL http://ar xiv.org/a bs/1909.09705 . Mohammadreza Nazari, Afshi n Oroojlooy , Lawrence Snyder , and Martin T akác. Reinforcement learning for solv ing t he vehicle routing prob l em. In Advances in Neural Informat ion Pr ocessing Systems , pages 9839– 9 849, 2 0 18. Andrew Y . Ng, Daishi Harada, and Stuart J. Russ ell. Policy in variance u n der rew ard t rans- formations: Theory and app lication to rew ard shapi n g . In Pr oceedings of the Sixteenth International Confer ence on Machine Learning , ICML ’99, pages 278–287, San Fran- cisco, CA, USA, 1999. Morgan Kaufmann Publi s hers Inc. ISBN 1-55860-61 2-2. URL http://dl .acm.org/ citation.cfm?id=645528.657613 . Thanh Thi Nguyen, Ngoc Duy Ng u yen, and Saeid Naha vandi. Deep reinforcement learning for multiagent systems: A revie w of challenges, s o lutions, and applications. IEEE transactions on cybernetics , 50(9):3826–3839, 2020. Shayegan Om idshaﬁei, Jason Pazis, Christopher Amato , Jonathan P How , and Joh n V ian. Deep de- centralized m ulti-task multi -agent reinforcement learning under partial observability . In Pr oceed- ings of the 34th International Confer ence on Machine Learning-V olume 70 , pages 2681–269 0 . JMLR. or g, 201 7 . Afshin Oroojlo oyjadid, Moham madReza Nazari, Lawrence V . Snyder , and Martin T aká ˇ c. A deep q-network for th e beer game: Deep rein forcement learning for in ventory optim ization. Manufa c- turing & Service Operations Management , 0(0):null, 0. doi : 10.1 2 87/msom.2020.0 939. URL https://d oi.org/10 .1287/msom.2020.0939 . Ling Pan, Qi ngpeng Cai, Qi Meng, W ei Chen, and Longbo Huang. Reinforcement l earning with dy n amic boltzmann soft m ax updates. In Christian Bessiere, editor , Pr oceedings of the T wenty-Ninth Internati onal Joint Confer ence on Artiﬁcial Intell i gence , IJCAI-20 , pages 1992– 1998. International Joint Conferences on Artiﬁcial Intelligence Organization, 7 2020. do i : 10.24963/ij cai.2020/276. URL https://d oi.org/10 .24963/ijcai.2020/276 . Main track. 72 Peng Peng, Y ing W en, Y aodong Y ang, Quan Y uan, Zhenkun T ang, Haitao Long, and Jun W ang. Multiagent bidirectionally-coordi nated nets: Emergence of human-lev el coordinati on in learning to play starcraft combat g am es. 2017. Paris Pennesi and Ioanni s Ch Paschalidis. A di stributed actor-critic algorithm and applications to mobile sensor n etwork coordination prob l ems. IEEE T ransactions on Automatic Contr ol , 55(2): 492–497, Feb 2010. ISSN 0018-92 8 6 . doi: 10.1109/ T A C.2009.2037462. Kirstin Petersen. T ermes: An autonomous robotic system for three-dimensional collective con- struction. Ro b otics: Science and Systems VII , page 257, 2012. KJ Prabucha ndran, Hemanth Kumar AN, and Shalabh Bhatnagar . M ulti-agent reinforcement learn- ing for traf ﬁc signal control. In 17th Internati o nal IEEE Confer ence on Intelligent T ransport a- tion S yst ems (ITSC) , pages 252 9 –2534. IEEE, 2014. LA Prashanth and Shalabh Bhatnagar . Reinforcement learning with function approximation for traf ﬁc signal control. IEEE T ransaction s on Intelligent T ransportatio n Systems , 12(2): 4 12–421, 2010. Guannan Qu and Na Li. Harnessing smoothn ess to accelerate distributed optimization. IEEE T ransactions on Contr ol of Network Syst ems , 5(3):1245–126 0 , 2017. Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M . Al i Eslami, and Matthew Botvinick. M achine theory of mi n d. In Jennifer Dy and Andreas Krause, editors, Pr o ceedings of the 35th Interna t ional Confer ence on Machine Learning , volume 80 of Pr oceedings o f Machine Learning Resear c h , pages 4218–4227, Stockholmsm assan, Stockho l m Sweden, 10– 15 Jul 2018. PMLR. URL http://pr oceedings .mlr.press/v80/rabinowitz18a.html . T abish Rashid, Mi kayel Samvelyan, Christ ian Schroeder , Gregory Fa rquhar , Jakob Foerster , and Shimon Whiteson. QMIX: Monotonic value function fa ctorisation for d eep multi-agent rein- forcement learning. In Jennifer Dy and Andreas Krause, editors, Pr oceedings of the 35th In- ternational Confer ence on Machine Learning , volume 80 o f Pr oceedings of Machine Learning Resear c h , pages 4 295–4304, Stockholmsmass an, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://pr oceedings .mlr.press/v80/rashid18a.html . Heechang Ryu, Hayong Shin, and Ji n kyoo Park. Multi-agent actor- critic with generative coopera- tiv e policy network. a r Xiv pr eprint arXiv:181 0 .09206 , 2018. Mikayel Samvelyan, T abish Rashid, Christ ian Schroeder de W itt, Gregory Farquhar , Nantas Nardelli, T im G. J. Rudner , Chia-Man H u ng, Phil iph H . S. T orr , J akob Foerster , and Shim on Whiteson. The StarCraft Multi-Agent Challenge. CoRR , abs/1902.0 4 043, 2019. 73 Gildardo Sanchez and J-C Latombe. Using a prm planner to compare centralized and decou- pled pl anning for m ulti-robot systems. In Proc eedings 2002 IEEE Intern a tional Confer ence on Robotics and Automation (Cat. No. 02CH37292) , volume 2, pages 2112–2119. IEEE, 2002. Guillaume Sartoretti, J ustin Ker r , Y unfei Shi, Glenn W agner , TK Satish Kumar , Sven K oenig, and Howie Choset. Primal: Pathﬁnding via reinforcement and imitati on m ulti-agent learning. IEEE Robotics and Automation Letters , 4(3):23 78–2385, 2019a. Guillaume Sartoretti, Y ue W u, W i lliam Paivine, TK Satish Kumar , Sven Koenig, and Howie Choset. Distributed reinforcement learning for multi -rob o t decentralized collective construction. In Di stributed Autonomous Roboti c Systems , pages 35–49. Sprin g er , 2019b. Manolis Savv a, Angel X. Chang, Alexe y Dosovitskiy , Th o mas Funkhouser , and Vladl en K oltun. M INOS: Mult imodal indoor simulator for navigation in compl e x en vironments. arXiv:1712.03931 , 2017 . T om Schaul, John Quan, Ioannis Anton oglou, and D avid Silver . Prioritized experience replay . In ICLR (P o ster) , 2016. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnit e su ms wi t h the stochastic a verage gradient. Mathematical Pro gramming , 162 (1-2): 83–112, 2017. Christian Schroeder de W itt, Jakob Foerster , Gregory Farquhar , Philip T orr , W endelin Boehmer , and Shimon Whit eson. Multi-agent common knowledge reinforcement learning. Advances in Neural Information Pr ocessi ng Syst ems , 32:9927–99 39, 20 1 9. John Schulman, Ser gey Levine, Pieter Abbeel, Michael Jordan, and Phi l ipp Moritz. Tr ust region policy optimization . In Interna tional confer ence on machine learning , pages 1 8 89–1897, 2015. John Schulman, Fili p W olsk i , Prafulla Dhariwal, Alec Radford, and Oleg Klim ov . Proximal policy optimizatio n algorithms. ar Xiv pr eprint arXiv:170 7 .06347 , 2017 . Ajay Seth, M i chael Sherman, Jeffre y A Reinbolt, and Scott L Delp. Opensi m : a musculoskeletal modeling and si mulation framework for in silico in vestigations and exchange. Proce dia Iutam , 2:212–232, 20 11. Shai Shale v-Shwartz, Shaked Shammah, and Am non Shashu a. Safe, multi-agent, reinforcement learning for autonomo us driving. arXiv pr eprint arXi v:1610.03295 , 20 1 6. Guni Sharon, Roni Stern, Ariel Felner , and Nathan R Sturtev ant. Conﬂict-based search for optim al multi-agent pathﬁndi ng. Artiﬁ cial Int el l igence , 219 : 40–66, 2015. T ianmin Shu and Y uando ng Tian. M 3 RL: Mi n d-a ware mul ti-agent management reinforce- ment learning. In Internat ional Confere nce on Learning Repr esentations , 2019. URL https://o penreview .net/forum?id=BkzeUiRcY7 . 74 Maria Amélia Lopes Silva, Sérgio Ricardo de Souza, Marcone Jamil son Freitas Souza, and An a Lúcia C Bazzan. A reinforcement learning-based multi-agent framework applied for s olving routing and schedul ing p roblems. Expert Systems with App l ications , 131:148 – 171, 2019. Da vid Silver , Guy Le ver , Nicol as Heess, Tho mas Degris, Daan W ierstra, and Marti n Riedmiller . Deterministic p olicy g radient algorithm s. In Eric P . Xing and T ony Jebara, editors, Pr oceed- ings of t h e 31st Internat i onal Conf er ence on Machine Learning , volume 32 of Pr oceedings of Machine Learni n g Resear ch , pages 387–395, Bejing, China, 22–24 Jun 2014. PMLR. URL http://pr oceedings .mlr.press/v32/silver14.html . Da vid Silver , Aja Huang, Chris J Maddison, Art h ur Gu ez, Laurent Sifre, Geor ge V an Den Dri ess - che, Juli an Schrittwieser , Ioannis Anton o glou, V eda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go wit h deep neural net works and tree search. natur e , 529(7587):484, 20 16. Da vid Silver , Jul ian Schrittwieser , Karen Sim onyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Bak er , Matthew Lai, Adrian Bolton , et al. Mastering the gam e o f go without human knowledge. Natur e , 55 0 (7676):354, 2017. Karen Simo nyan and Andre w Ziss erman. V ery deep con volutional networks for large-scale image recognition. arXiv pr eprint ar X i v:1409.1556 , 201 4. Amanpreet Singh , T u shar Jain, and Sainbayar Sukhbaatar . Learning when to communicate at scale in mu l tiagent cooperative and competitive tasks. In ICLR , 2018. Roman Smierzchalski and Zbigniew Michalewicz. Path plannin g in dynamic en vironments. In Innovations in Robot Mobility and Contr ol , pages 135–153. Springer , 2005. Kyunghwa n Son, D aewoo Kim, W an Ju Kang, David Earl Hostallero, , and Y ung Y i. Qt ran: Learning to factorize wit h transformation for cooperative m ulti-agent reinforcement learning. In Pr oceedings of the 31st Intern a tional Confer ence on Machine Learnin g , Proceedings of Machine Learning Research. PMLR, 2019. Y uhang Song, Andrzej W ojci cki , Th o m as Lu k as i e wicz, J ianyi W ang, Abi Aryan, Zhengh u a Xu, Mai Xu, Zihan Ding, and Lianl o ng W u. Arena: A general ev aluation platform and building tool kit for mul t i-agent intelli gence. Proc eedings of the AAAI Confer ence on Ar- tiﬁcial Intelligence , 34(05):7253 – 7260, Apr . 2020. doi: 10 .1609/aaai.v34i05.6216. URL https://o js.aaai.o rg/index.php/AAAI/article/view/6216 . Miloš S Stankovi ´ c and Srdjan S Stankovi ´ c. Mult i-agent t em poral-dif ference learning with lin- ear functio n approximation : W eak con vergence under tim e-v arying network topolo g ies. In 2016 American Contr o l Confer ence (ACC ) , pages 167–172, J uly 2016. doi: 10.110 9/A CC.2016. 7524910. 75 Peter Stone and Manuela V eloso. Multiagent syst ems: A survey from a machine learning perspec- tiv e. Autonomous Robots , 8(3):345–383, 20 0 0. Joseph Suarez, Y ilun Du, Phil lip Isola, and Igo r Mordatch. N eural mm o: A massively mul- tiagent game en vironm ent for training and ev aluating intell igent agents. arXi v pr eprint arXiv:1903.00784 , 2019 . Sainbayar Sukhbaatar , Arthur Szlam, Gabriel Synnaev e, Soumith Chintala, and Rob Fergus. Maze- base: A sandbo x for learning from games. arXiv pre print arX i v:1511.07401 , 2015. Sainbayar Sukh b aatar , Rob Fer gus, et al. Learning m u ltiagent communication w i th backpropaga- tion. In Advances in Neural Information Proce ssing Systems , pages 2244–2252, 201 6 . Gita Sukthankar and Juan A Rodri g u ez-Aguilar . Autonomous Agents and Multiagent Systems: AAMAS 2017 W orkshop s , Best P apers, São P a ulo, Brazil , May 8-12, 2017, Revised Selected P apers , volume 10642. Springer , 2017. W en Sun, Nan Jiang , Akshay Krishnamurthy , Alekh Agarwal, and J ohn Langford. Mo d el-based rl in contextual decision processes: P ac bou nds and exponential improvements over mod el -free approaches. In Confer ence on Learning Theory , pages 2898–2 933. PML R, 2019. Peter Sunehag, Guy Le ver , Audrunas Gruslys, W ojciech Marian Czarnecki, V inicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo , Karl T uyls, et al. V alue- decompositio n networks for cooperative multi -agent learning based on team re w ard. In Pr o- ceedings of t he 17th Internati o nal Confer ence on Autonomous Agents and MultiAgent Syst ems , pages 208 5 –2087. International Foundation for Autonomou s Agent s and Mul tiagent Syst ems, 2018. W esley Suttle, Zhuoran Y ang, Kaiqing Zhang, Zhaoran W ang, T amer Ba ¸ sar , and Ji Liu. A multi-agent off-policy actor-critic algo rithm for distributed re- inforcement l earning. IF AC- P apersOnLine , 53(2):1549–1554 , 2020. ISSN 2405-8963. doi: https : //doi.org/10.1016/j.ifacol.2020.12.2021. URL https://w ww.scienc edirect.com/science/article/pii/S2405896320326562 . 21 th IF A C W orld Congress. Richard S Sutton and And re w G Barto. Reinfo rce ment l earning: An intr odu ct i on . MIT press, 2018. Richard S Sutto n , David A McAllester , Satind er P Singh, and Y i s hay Mansour . Polic y gradient methods for reinforcement learning with function approxi m ation. In Advances in n eural infor- mation pr o cessi ng syst ems , pages 1057–1063, 200 0 . Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar , David Silver , Csaba Szepesvári, and Eric W ie wiora. Fast gradient-descent methods for temporal-differ ence 76 learning wit h linear function approximation. In Pr oceedings of the 26t h Annual Inter- national Confer ence on Machine Learning , ICML ’09, pages 993–100 0, New Y ork, NY , USA, 2009. A CM. ISBN 978-1-60558-516-1. doi : 10.114 5/1553374.1553501 . URL http://do i.acm.org /10.1145/1553374.1553501 . Richard S Sutto n , A Rupam Mahmood, and M artha White. An emphati c approach to the problem of off- policy tempo ral-dif ference learning. The J ournal of Machine Learnin g Resear ch , 17(1): 2603–2631, 20 1 6. Ardi T ampuu, T ambet Matii sen, Dorian Kodelja, Ilya Kuzovkin, Krist j an Korjus, Juhan Aru, Jaan Aru, and Raul V icente. Multiagent cooperation and comp eti tion with deep rein forcement learn- ing. PloS one , 12(4):e0172395, 20 1 7 . Ming T an. Multi-agent reinforcement learning: Independ ent vs. cooperativ e agents. In Pr oceed- ings of the tenth international confer ence on machine lear n ing , pages 3 3 0 –337, 1993. Hongyao T ang, Jianye Hao, T angjie Lv , Y ingfeng Chen, Zon g zhang Zhang, Hangt i an Jia, Chunxu Ren, Y an Zheng, Changjie Fan, and Li W ang. Hierarchical deep mul tiagent reinforcement learn- ing. arXiv pr eprint arXiv:180 9 .09332 , 2018. Norman T asﬁ. Pygame learning en vironment. https://gith ub.com/nt asfi/PyGame- Learning- Environme nt , 2016. Emanuel T odorov , T om Erez, and Y uval T assa. Mujoco: A physics engine for model-based control . 2012 IEEE/RSJ Intern ational Confer ence on Intelligent Ro b o ts and Systems , pages 5026–5033, 2012. Nicolas Usun ier , G abriel Synnaev e, Zeming Lin , and Soum ith Chintala. Episodic exploration for deep determ i nistic pol icies for starcraft mi cromanagement. In Internati onal Confer ence on Learning Repre sentations , 2017. URL https://o penreview .net/forum?id=r1LXit5ee . Jur V an Den Ber g, Stephen J Guy , Ming Lin, and Din esh Mano cha. Reciprocal n-body colli sion a voidance. In Robotics re sear c h , pages 3–19. Springer , 2011. Hado V an Hasselt, Art hur Guez, and David Silver . Deep reinforcement learning with do uble q- learning. In Thirti eth AAAI confer ence on ar t iﬁcial intelligence , 2016 . Harm V an Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, T avian Barnes, and Jef fre y Tsang. Hybrid reward architecture for reinforcement learning. In Advances in Neural Informa- tion Processing Systems , pages 5392–5402, 20 1 7. Paulina V arsha vskaya, Leslie Pack Kaelbling, and Daniela Rus. Efﬁcient di strib uted reinforcement learning throu g h agreement. In D i stributed Autonomous Robotic Systems 8 , pages 367–378. Springer , 2009. 77 Ashish V aswani, Noam Shazeer , Nik i Parmar , Jakob Uszkoreit, Llion Jones, Aid an N Gomez, Łukasz Kaiser , and Illia Polosukh i n. Attention i s all you need. In Advances in neural inf o rmation pr ocessing s ystems , pages 5998–6008, 20 17. Alexander Sasha V ezhne vets, Simon Osindero, T om Schaul, Nicolas Heess, Max Jaderberg, David Silver , and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Pr oceedings of the 3 4th Internati onal Conf er ence on Machine Learnin g -V ol ume 7 0 , pages 3 540– 3549. JM LR. org, 2017. Glenn W agner and Howie Choset. Subdimensional expansion for multirobot path plannin g. Artiﬁ- cial Intell igence , 219:1–24, 20 1 5. Hoi-T o W ai, Zhuoran Y ang , Princeton Zhaoran W ang, and Mingy i Hong. Multi-agent reinforce- ment learning via double averaging primal-dual opti mization. In Advances in Neural Information Pr ocessing S ys t ems , pages 9649–9660, 2018 . Binyu W ang, Zhe Liu, Qingbiao Li, and Amanda Prorok. Mobi le robot path planning in dynamic en vi ronments through glob ally g u ided reinforcement learning. IEEE Rob otics and Automation Letters , 5(4):693 2–6939, 2020a. Hongbing W ang, Xiaojun W ang, Xi n gguo Hu, Xingzhi Zhang, and Mi ngzhu Gu. A multi-agent reinforcement learning approach to dy n amic service com position. Information Sciences , 363: 96–119, 2016a. Lingxiao W ang, Qi Cai, Zhuoran Y ang, and Zhaoran W ang. Neural policy gradient methods: Global op t imality and rates of con ver gence. In Internati onal Confer ence on Learning Repr esen- tations , 20 2 0b. URL https://o penreview .net/forum?id=BJgQfkSYDS . Rose E W ang, Michael Everett, and Jonath an P How . R-maddpg for partially obs erv able en viron - ments and limited communication . In Reinfor cement Learning for Real Life W orkshop in the 36th Int ern ational Confer ence on Machine Learning, Long Beach , 2019. Shiyong W ang, J iafu W an, Daqiang Zhang, Di Li , and Chunhua Zhang. T owards sm art factory for industry 4.0 : a self-organized multi-agent sys tem with big data based feedback and coordination. Computer Networks , 101:158– 1 68, 20 1 6b. Ziyu W ang , T om Schaul, Matteo Hessel, Hado H ass elt, Marc Lanctot , and Nando Fre- itas. Dueling network architectures for deep reinforcement l earning . In Maria Flo- rina Balcan and Kilian Q. W einberger , edit ors, P rocee dings of The 33r d Internationa l Confer ence on Machine Learning , volume 48 of Pr o ceeding s of Mac hine L earning Re- sear c h , pages 1995–2003, New Y ork, Ne w Y ork, USA, 20– 22 Ju n 201 6c. PMLR. URL http://pr oceedings .mlr.press/v48/wangf16.html . Christopher JCH W atkins and Peter Dayan. Q-learning. Machine l earning , 8(3-4):279–292, 1992. 78 Hua W ei, Chacha Chen, Gu anjie Z h eng, Kan W u, V ikash Gayah, Kai Xu, and Zhenhui Li. Presslight: Learning max pressure control to coordi n ate trafﬁc sign als in arterial network. In Pr oceedings of the 25th ACM SIGKDD International Confer ence on Knowledge Dis cov ery & Data Mining , KDD ’19, pages 129 0–1298, 2019a. Hua W ei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen, W einan Z h ang, Y anmin Zhu, Kai X u , and Zhenhui Li. Coligh t : Learning network-le vel cooperation for t raf ﬁc signal control. In Pr oceedings of the 28th ACM Int ernational Conf er ence on Informatio n and Knowledge Management , pages 1913–1922, 20 19b. Hua W ei, Guanjie Zheng, V ikash Gayah, and Zhenh u i Li. A s urve y on trafﬁc sig n al control meth- ods. arXiv pr eprint arXiv:190 4 .08117 , 2019c. Gerhard W eiß . Distributed reinforcement learning. In L uc Steels, editor , The Biology and T echnol- ogy of Intelli gent Autonomous Agents , pages 415–428, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg. Ronald J W illiams. Simple statisti cal gradient-foll owing algorith ms for connection i st reinforce- ment learning . Machine learnin g , 8(3-4):229–256, 19 92. Cathy W u, Aboudy Kreidieh, Kanaad Parv ate, Eugene V in itsky , and Alexandre M Bayen. Flow: Architecture and benchmarking for reinforcement learning in traf ﬁc control. arXi v pr eprint arXiv:1710.05465 , 2017 . Jun W u and Xi n Xu. Decentralised grid scheduling approach based on multi-agent reinforce- ment learning and gossi p mechanism. CAAI T ransactions on Intell igence T echnology , 3(1):8–17, 2018. Jun W u, Xin Xu, Pengcheng Zhang, and Chunm ing Liu. A novel mul ti-agent reinforcement l earn- ing approach for job scheduling in grid computing. Futur e Generation Computer Syst ems , 27 (5):430–439, 2011 . Y i W u, Y uxin W u, Georgia Gkiox ari, and Y uandong Tian. Building generalizable agents with a realistic and rich 3 d en vironment. arXiv p reprint arXiv:1801.0 2 209 , 2018. URL https://o penreview .net/forum?id=rkaT3zWCZ . Ian Xi ao. A dist rib uted reinforcement learning solu tion with knowledge transfer capabilit y for a bike rebalancing problem. arXiv pr eprint arXi v:1810.04058 , 2018. Jiachen Y ang, Ali reza Nakhaei, David Isele, Kikuo Fuji mura, and Hongy u an Zha. Cm3: Coopera- tiv e mul ti-goal multi-st age multi-agent reinforcement learning. In Int ern ational Confer ence on Learning Repre sentations , 2020. URL https://o penreview .net/forum?id=S1lEX04tPr . Y aodong Y ang and Jun W ang. An overvie w of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.005 8 3 , 202 0 . 79 Y aodong Y ang, Rui Luo, Minne Li, M ing Zh o u, W einan Zhang, and Ju n W ang. Mean ﬁeld multi- agent reinforcement learning. In J ennifer Dy and Andreas Krause, editors, Pr oceedings of the 35th Int ern ational Confer ence on Machine Learning , volume 8 0 of P r oceedings of Machine Learning Resear c h , pages 5571–5580, Stockholmsmass an, Stockholm Sweden, 10–15 Jul 2018a. PMLR. Zhuoran Y ang, Kaiqing Zhang, Mingyi Hong, and T amer Ba ¸ sar . A ﬁnite sample analysis of the actor -critic algorithm. In 2018 IEEE Confere nce on Decision and Cont r ol (CDC) , p ages 2 759– 2764. IEEE, 2018 b. Dayong Y e, Minjie Zhang, and Y un Y ang. A mul t i-agent framew ork for packet routing in wireless sensor networks. sensors , 15(5):10026–1 0047, 2015. Bicheng Y i n g, Kun Y uan, and Ali H Sayed. Con ver gence of variance -reduced learning under random reshufﬂing. In 2018 IEEE Internation a l Confer ence on Acoustics, Speech and Si gnal Pr ocessing (ICASSP) , pages 2286–229 0 . IEEE, 2018. Huizhen Y u. On conv er gence of emphat i c tem p oral-dif ference learning. In Confer ence on Learni ng Theory , pages 1724 – 1 751, 2 015. Erik Zaw adzki, Asher Lipson , and Ke vin Leyton-Brown. Empirically ev aluating multi agent learn- ing algo ri t hms. arXiv pr epri nt arXiv:1401.8 0 74 , 20 1 4. Chengwei Zhang, Xiaohong Li, J i anye Hao, Siqi Chen, Karl T uyls, and Zhiyong Feng. Scc-rfmq learning in cooperativ e markov games with continuo us actions. In Pr oceedings of the 17t h International Confer ence on Autonomous Agents an d MultiAgent Systems , pages 2162–2164. International Foundation for Auton o m ous Agents and Mul t iagent Syst ems, 2018a. Chongjie Zhang , V ictor Lesser , and Prashant Shenoy . A mu lti-agent learning approach to online distributed resource allocatio n. In T wenty-F irst International J oint Confer ence on Artiﬁcial In- telligence , 2009. Huaguang Zhang, He Jiang, Y anhong Luo, and Geyang X i ao. Data-driv en opti mal cons ensus con- trol for dis crete-tim e mul ti-agent s ystems wit h unk n own dy n amics u s ing rein forcement learning method. IEEE T ransactions on Indust rial Electr on i cs , 64(5):4091–4100, May 2017 . ISSN 0278- 0046. doi: 10 . 1 109/TIE.2016.2542134 . Huichu Zhang, Siyuan Feng, Chang Liu, Y aoyao Ding, Y ichen Zhu, Zihan Zhou, W ei n an Zhang, Y ong Y u, Haiming J in, and Zhenhui Li. Cityﬂow: A multi-agent reinforcement learning en viron- ment for large scale city traf ﬁc scenario. In The W or l d W ide W eb Confer ence , pages 3620–36 2 4. A CM, 2019 a. 80 Kaiqing Z h ang, Zhuoran Y ang , and T amer Basar . Networked multi-agent reinforcement learning in continuous spaces. In 2018 IEEE Confer ence on Decision and Cont r ol (CDC) , pages 2771–2776. IEEE, 2018b. Kaiqing Zhang, Zhuoran Y ang, Han Liu, T o ng Zhang, and T amer Basar . Fully decentralized multi-agent reinforcement learning wit h networked agents. In Jennifer Dy and A n d reas Krause, editors, Procee dings of the 35th International Confer ence on Ma c hine Learning , volume 80 of Pr oceedings of Machine Learning Resear ch , pages 5872–5 881, Stockholm smassan, Stockholm Sweden, 10– 15 Jul 2 018c. PMLR. Kaiqing Z h ang, Zhuoran Y ang, and T amer Ba ¸ s ar . Multi -agent reinforcement learning: A s electi ve overvie w of theories and algorit hms. arXiv pr epri nt arXiv:1911.1 0 635 , 2 0 19b. Kaiqing Zhang, Al ec Koppel, Hao Zhu, and T amer Basar . Global con ve rgence of pol i cy gradient methods to (alm ost) l o cally opt imal poli cies. SIAM Journal o n Contr o l and Optimizat i on , 58(6): 3586–3612, 20 2 0a. Ke Zh ang , Fang He, Zhengchao Zhang, Xi Lin, and Meng Li. Multi-vehicle routin g problems wit h soft ti me windows: A m ulti-agent reinforcement learning approach. T ransporta tion Resear ch P art C: Emer ging T echnologies , 121:102861 , 2020b. Shangtong Zhang and Richard S Sutt o n. A deeper look at experience replay . arXiv pr eprint arXiv:1712.01275 , 2017 . Y . Zh ang and M. M. Za vlanos. Di s trib uted off-polic y actor-c ritic reinforcement learning with policy consensus. In 2019 IEEE 58th Confer ence on Decision and Contr o l (CDC) , pages 4674– 4679, 20 19. doi: 10.110 9 /CDC40024.2019.9029969. Guanjie Zheng, Y uanhao Xiong, Xinshi Zang, Jie Feng, Hua W ei, Huichu Zhang , Y ong L i , Kai Xu, and Zhenhui Li . Learning phase competi tion for traf ﬁc signal control. In Pr oceedings of the 28th A CM International Confer ence on Informat ion and Knowledge Management , pages 1963–1972, 2019. Xingdong Zuo. mazelab: A customi zable framew ork to create m aze and gridworld en vironments. https://g ithub.com /zuoxingdong/mazelab , 2 0 18. 81

A Review of Cooperative Multi-Agent Deep Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment