Learning Multi-Agent Coordination through Connectivity-driven Communication

In artificial multi-agent systems, the ability to learn collaborative policies is predicated upon the agents' communication skills: they must be able to encode the information received from the environment and learn how to share it with other agents …

Authors: Emanuele Pesce, Giovanni Montana

Learning Multi-Agent Coordination through Connectivity-driven   Communication
Springer Nature 2021 L A T E X template Learning Multi-Agen t Co ordination through Connectivit y-driv en Comm unication Eman uele P esce 2 and Gio v anni Mon tana 1,2,3* 1 Departmen t of Statistics, Univ ersity of W arwic k, Cov en try , CV4 7AL, UK. 2 WMG, Univ ersity of W arwic k, Cov en try , CV4 7AL, UK. 3 Alan T uring Institute, London, NW1 2DB, UK. *Corresp onding author(s). E-mail(s): g.mon tana@warwic k.ac.uk ; Con tributing authors: e.p esce@w arwic k.ac.uk ; Abstract In artificial m ulti-agent systems, the abilit y to learn collaborative policies is predicated up on the agen ts’ communication skills: they m ust b e able to enco de the information received from the en vironment and learn ho w to share it with other agen ts as required b y the task at hand. W e present a deep reinforcement learning approach, Connectivity Driven Communi- cation (CDC), that facilitates the emergence of m ulti-agent collab orativ e b eha viour only through experience. The agen ts are modelled as no des of a w eighted graph whose state-dep endent edges encode pair-wise messages that can b e exchanged. W e in tro duce a graph-dependent atten tion mec h- anisms that controls how the agents’ incoming messages are w eighted. This mechanism takes in to full accoun t the current state of the system as represented by the graph, and builds up on a diffusion pro cess that captures how the information flows on the graph. The graph top ology is not assumed to b e kno wn a priori, but dep ends dynamically on the agen ts’ observ ations, and is learnt concurren tly with the atten tion mech- anism and policy in an end-to-end fashion. Our empirical results show that CDC is able to learn effective collab orativ e p olicies and can o ver- p erform comp eting learning algorithms on co op erativ e navigation tasks. Keyw ords: Reinforcement Learning, Multi-agent system, Neural Netw orks, Graphs 1 Springer Nature 2021 L A T E X template 2 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 1 In tro duction In reinforcement learning (RL), an agent learns to tak e sequential decisions by mapping its observ ations of the world to actions using a reward as feedback signal [ 1 ]. In the last few years, deep artificial neural netw orks [ 2 , 3 ] hav e b een lev eraged to impro v e the learning abilit y of RL algorithms in a num b er of w ays, e.g. as p olicy function appro ximators to map observ ations to actions and to learn informativ e data representations. The resulting deep reinforcement learning algorithms (DRL) hav e recently ac hieved unpreceden ted p erformance in single-agen t tasks, e.g. in playing Go [ 4 ] and A tari games [ 5 , 6 ]. Multi-agen t reinforcement learning (MARL) extends RL to problems char- acterized b y the interpla y of multiple agents op erating in a shared en vironment. This is a scenario that is typical of many real-w orld applications including rob ot navigation [ 7 ], autonomous v ehicles coordination [ 8 ], traffic management [ 9 ], and supply c hain managemen t [ 10 ]. Compared to single-agen t systems, MARL presents additional lay ers of complexity . When m ultiple learners in ter- act with each other, the en vironmen t b ecomes highly non-stationary from the p oin t of view of eac h individual actor [ 11 ]. Moreo ver, credit assignment [ 12 ], whic h is the ability to determine how the actions of eac h individual agen t impact on the o verall system performance, b ecomes particularly difficult [ 13 – 15 ]. W e are interested in systems in volving agents that autonomously learn ho w to collab orate in order to achiev e a shared outcome. When multiple agen ts are exp ected to develop a co operative b eha viour, an imp ortan t need emerges: an adequate communication proto col m ust b e established to supp ort the lev el of co ordination that is necessary to solve the task. The fact that comm unication pla ys a critical role in achieving sync hronization in m ulti-agent systems has b een extensiv ely documented [ 16 – 23 ]. Building up on this evidence, a num ber of multi-agen t DRL algorithms (MADRL) hav e b een developed lately which try to facilitate the sp ontaneous emergence of comm unication strategies during training. In particular, significant efforts hav e gone in to the dev elopment of atten tion mechanisms for filtering out irrelev ant information [ 24 – 30 ] (see also Section 4). In this pap er we introduce a MADRL algorithm for co op erativ e m ulti- agen t tasks. Our approac h relies on learning a state-dependent communication graph whose topology con trols what information should b e exc hanged within the system and how this information should b e distributed across agen ts. As suc h, the comm unication graph plays a dual role. First, it represen ts ho w ev ery pair of agen ts join tly enco des their observ ations to form lo cal messages to b e shared with others. Secondly , it con trols a mechanism b y which lo cal messages are propagated through the netw ork to form agen t-sp ecific information con- ten t that is ultimately used to make decisions. As we will demonstrate, this approac h supp orts the emergence of a collab orativ e decision making p olicy . The core idea w e intend to exploit is that, giv en any particular state of the en vironment, the graph top ology should b e self-adapting to supp ort the most Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 3 efficien t information flo w. This raises the question: how should efficiency b e measured? Our prop osed approac h, c onne ctivity-driven c ommunic ation (CDC), is inspired by the pro cess of heat transference in a graph, and specifically the heat kernel (HK). The HK describes the effect of applying a heat source to a netw ork and observing the diffusion pro cess ov er time. As such, it can b e used to characterise the wa y in whic h the information flo ws across no des. The HK has been used in a num ber of different application domains where there is a need to characterise the top ology of graph, e.g. in 3D ob ject recognition [ 31 ] and neuroimaging [ 32 , 33 ]. V arious metrics obtained from the HK hav e b een used to organise the intrinsic geometry of a netw ork o ver multiple-scales b y capturing local and global shapes’ in relation to a no de via a time param- eter. The HK also incorp orates a concept of no de influence as measured b y heat propagation in a netw ork, whic h can be exploited to c haracterise how effi- cien tly the information propagates betw een any pair of no des. T o the best of our kno wledge, this is the first time that the HK has b een used to develop an end-to-end learnable atten tion mechanism enabling m ulti-agent co op eration. Our approach relies on an actor-critic paradigm [ 34 – 36 ] and is intended to extend the cen tralized-learning with decentralized-execution (CLDE) frame- w ork [ 20 , 37 ]. In CDC, all the observ ations from each agent are assumed kno wn only during the training phase whilst during execution each agent makes autonomous decisions using only their o wn information. The entire mo del is learned end-to-end supp orted b y the fact that the heat-kernel is a differen- tiable operator allo wing the gradien ts to flo w throughout the architecture. The p erformance of CDC has been ev aluated against alternativ e methods on four co operative na vigation tasks. Our exp erimen tal evidence demonstrates that CDC is capable of outperforming other relev ant state-of-the-art algorithms. In addition, we analyse the comm unication patterns disco vered by the agents to illustrate ho w interpretable top ological structures can emerge in differen t scenarios. The structure of this w ork is as follows. In Section 2 we discuss related state-of-the-art MADRL metho ds fo cusing on co operating systems with com- m unication mec hanisms. In Section 3 w e pro vide the details of the prop osed CDC algorithm. Exp erimen tal results are then pro vided in Section 4 . Finally , in Section 5 , we discuss the b enefits and p otential limitations of the proposed metho dology with a view on further improv ements in future work. 2 Related W ork Multi-agen t systems hav e b een widely studied in a num b er of different domains, suc h as machine learning [ 38 ], game theory [ 39 ] and distributed systems [ 40 ]. Recent adv ances in deep reinforcement learning hav e allo wed m ulti-agent systems capable of autonomous decision-making [ 41 – 43 ] impro ving Springer Nature 2021 L A T E X template 4 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation tabular-based solutions [ 44 ]. In this section, w e briefly review recent develop- men ts in MADRL with a focus on comm unication strategies that hav e been prop osed to improv e co operation. 2.1 Centralised learning with decen tralised execution When multiple learners in teract with e ac h other, the environmen t b ecomes non-stationary from the p erspective of individual agents whic h results in increased training instability [ 45 , 46 ]. An approac h that has prov ed particu- larly effectiv e consists of training the agents assuming centralised access to the entire system’s information whilst executing the p olicies in a decentralised manner (CLDE) [ 20 , 23 , 29 , 37 , 47 , 48 ]. During training, a critic module has access to information related to other agents, i.e. their actions and observ a- tions. MADDPG [ 37 ], for example, extends DDPG [ 35 ] in this fashion: each agen t has a cen tralised critic providing feedback to the actors, whic h decide what actions to tak e. A v arian t of this approach has recen tly b een prop osed to deal with partially observ able environmen ts through the use of recurren t neural netw orks [ 49 , 50 ]. In [ 48 ], a centralised critic is used to estimate the Q-function whilst decen tralised actors optimise the agen ts’ p olicies. In [ 51 ], an action-v alue critic netw ork co ordinates decen tralised p olicy net works for a fleet managemen t problem. 2.2 Communication metho ds Comm unication has alwa ys pla yed a crucial role in facilitating sync hronization and co ordination [ 52 – 56 ]. Some of the recen t MADRL approaches facilitate the emergence of nov el comm unication protocols through comm unication mec ha- nisms. F or example, in CommNet [ 21 ], the hidden states of an agen t’ neural net work are first a veraged and then used jointly with the agen t’s own obser- v ations to decide what action to take. Similarly , in [ 57 ], communication is enabled b y connecting agents’ p olicies through a bidirectional recurrent neu- ral netw ork that can pro duce higher-level information to b e shared. In IC3Net [ 22 ], a gating mechanism decides whether to allo w or blo c k access to other agen ts’ hidden states. Other approac hes ha ve introduced explicit comm unication mechanisms that can b e learn t from exp erience. F or instance, in RIAL [ 20 ], each agen t learns a simple enco ding that is transferred o ver a differentiable channel and allo ws the gradient of the Q-function to flo w; this enables an agent’s feedback to take in to accoun t the exc hanged information. In our previous w ork, [ 23 ], the agen ts are equipped with a memory device allowing them to write and read signals to b e shared within the system. The communication mec hanism w e propose in this paper is also explicit; messages are signals that must b e shared within the system in order to maximize the shared rewards and serve no other purp ose. Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 5 2.3 Atten tion mec hanisms to supp ort communication In a collab orative decision making context, attention mechanisms are used to selectively identify relev ant information coming from the environmen t and other agents that should b e prioritised to infer better p olicies. F or example, in [ 24 ], the agen ts first enco de their observ ations to pro duce messages; then an attention unit, implemented as a recurren t neural netw ork (RNN), proba- bilistically controls which incoming messages are used as inputs for the action selection netw ork. The CommNet algorithm [ 21 ] has been extended using a m ulti-agent predictive modeling approach [ 27 ] which captures the locality of in teractions and improv es p erformance b y determining whic h agents will share information. In the IS algorithm [ 58 ] the agents predict their future tra jec- tories, and these predictions are utilised by an attention mechanism mo dule to comp ose a message determining the next actions to take. The T arMac algorithm [ 28 ] instead lev erages the signature-based atten tion mo del originally prop osed in [ 59 ]. Here, each agent receives the messages broadcasted b y oth- ers and pro duces a query that helps select what information to keep and what to discard. The latter approac h is closely related to the work prop osed in this pap er; ours agen ts also aggregate information coming from different sources in order to maximise their final rew ard. 2.4 Diffus ion processes on graphs Sp ectral graph theory allo ws to relate the prop erties of a graph to its spectrum b y analysing its asso ciated eigenv ectors and eigenv alues [ 60 – 62 ]. The heat ker- nel falls in this category; it is a pow erful and w ell-studied op erator allowing to study certain properties of a graph b y solving the heat diffusion equation. The HK is determined by exp onen tiating the graph’s Laplacian eigensystem [ 63 ] ov er time. The resulting features can be used to study the graph’s top ol- ogy and ha ve b een utilised across different applications whereby graphs are naturally o ccurring data structures; e.g. the HK has b een used for comm u- nit y detection [ 64 ], data manifold extraction [ 65 ], net work classification [ 33 ] and image smoothing [ 31 ] amongst others. In recen t w ork, the HK has b een adopted to extend graph conv olutional net works [ 66 ] and define edge struc- tures supp orting conv olutional op erators [ 67 ]. In this work, w e use the HK to characterise the state-dep enden t top ology of a m ulti-agent comm unication net work and learn ho w the information should flow within the netw ork. 2.5 Graph-based comm unication mec hanisms Graph structures pro vides a natural framework for mo delling interactions in RL domains [ 68 – 70 ]. Lately , Graph Neural Netw orks (GNNs) hav e also been adopted to learn useful graph representations in co operative m ulti-agent sys- tems [ 71 – 75 ]. F or example, graphs ha ve b een used to mo del spatio-temp oral dep endencies within episodes for traffic light con trol [ 76 ], and to infer a m ulti-agent connectivity structure which, once pro cessed by a GNN, gener- ates the features required to decide what action to tak e [ 77 – 79 ]. Heterogeneous Springer Nature 2021 L A T E X template 6 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation graph attention netw orks [ 80 ] hav e b een in tro duced to learn efficient and div erse communication mo dels for coordinating heterogeneous agents. Graph con volutional net w orks capturing m ulti-agent interactions ha v e also been com- bined with a counterfactual p olicy gradient algorithm to deal with the credit assignmen t problem [ 81 ]. GNNs hav e also supp orted the developmen t of multi-stage atten tion mech- anisms. F or instance, [ 26 ] describ e a tw o-stage approac h whereby m ulti-agent in teractions are first determined, and their importance is then estimated to generate actions. In GraphComm [ 82 ], the agents share their enco ded observ a- tions o ver a m ulti-step communication pro cess; at each step a GNN processes a graph and generates signals for the subsequen t communication round. This m ulti-round process is designed to increase the length of the communication mec hanism and fav our a longer range exc hange of information. The MAGIC algorithm [ 83 ] consists of a scheduled learning when to communicate and whom to address messages to, and a message processor to process communication signals; b oth comp onen ts hav e b een implemented using GNNs and the entire arc hitecture is learned end-to-end. In our prop osed model, the atten tion mechanism depends on how the enco ded information exc hanged amongst the agen ts flo ws within the graph; the graph top ology itself dep ends on the enco ded observ ations and the heat k ernel is used as a top ology-dep enden t feature to control the agent’s comm uni- cation. The pro cess of encoding the observ ations, inferring the graph topology , and learning the attention mechanism are all coupled with the aim to learn an optimal p olicy . 3 Connectivit y-driv en Comm unication 3.1 Problem setting W e consider Mark ov Games, partially observ able extension of Mark ov decision pro cesses [ 84 ] inv olving N interacting agents. W e use S to denote the set of en vironmental states; O i and A i indicate the sets of all p ossible observ ations and actions for the i th agen t, with i ∈ 1 , . . . N , resp ectiv ely . The agent-specific (priv ate) observ ations at time t are denoted by o t i ∈ O i , and each action a t i ∈ A i is deterministically determined by a mapping, µ θ i : O i 7→ A i , which is parametrised by θ i . A transition function T : S × A 1 × A 2 × · · · × A N describ es the stochastic behaviour of the en vironment. Each agen t receiv es a rew ard, defined as a function of states and actions r i : S × A 1 × A 2 × · · · × A N 7→ R and learns a p olicy that maximises the expected discounted future rew ards ov er a p eriod of T time steps, J ( θ i ) = E [ R i ], where R i = P T t =0 γ t r t i ( s t , a t 1 , . . . , a t N ) is the discoun ted sum of future rewards, where γ ∈ [0 , 1] is the discount factor. 3.2 Learning the dynamic comm unication graph W e mo del eac h agent as the no de of a time-dep ending, undirected (and unkno wn) w eighted graph, G t = ( V , S t ), where V is a set of N no des and S t Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 7 Fig. 1 : Diagrammatic represen tation of CDC at a fixed time-step. Agen ts’ observ ations are enco ded to generate a graph top ology (blue box on the left). The diffusion pro cess is used to quantify global information flow throughout the graph and to control the communication pro cess (blue b o x on the righ t). In this example, the line thickness is proportional to communication strength. A t training time, observ ations and actions are utilised by the critic to receiv e feedbac k on the graph comp onen ts. is an N × N matrix of edge weigh ts. Each S t ( u, v ) = S t ( v , u ) = s t u,v quan tifies the degree of communication or connectivity strength b et ween a given pair of agen ts, u and v . Specifically , we assume that eac h s t u,v ∈ [0 , 1] w ith v alues close to 1 indicating strong connectivities, and to 0 a lack of connectivity . In our formulation, each s t u,v is not known a priori . Instead, eac h one of these connectivities is assumed to be a time-dep enden t parameter that v aries as a function of the current state of the environmen t. This is done through the follo wing t wo-step pro cess. First, giv en a pair of agents, u and v , their priv ate observ ations at time-step t are encoded to form a lo cal message, c t u,v = c t v ,u = ϕ θ c ( o t u , o t v ) (1) where ϕ θ c is a non-linear mapping mo delled as a neural net work with param- eter θ c . Each lo cal message is then enco ded non-linearly to pro duce the Springer Nature 2021 L A T E X template 8 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation corresp onding connectivity w eight, s t u,v = s t v ,u = σ ( ϕ θ s ( c t u,v )) (2) where ϕ θ s is a neural net w ork parameterised by θ s and σ is the sigmoid function. 3.3 L earning a time-dep enden t atten tion mec hanism Once the time-dep enden t connectivities in Eq. 2 are estimated, the communi- cation graph G t is fully sp ecified. Giv en this graph, our aim is to c haracterise the relativ e contribution of eac h no de to the ov erall flo w of information ov er the en tire netw ork, and let these con tributions define a attention mechanism con trolling what messages are b eing exchanged. The resulting attention mec ha- nism should be differen tiable with resp ect to the netw ork parameters to ensure that, during bac kpropagation, all the gradients correctly flo w throughout the arc hitecture to enable end-to-end training. Our observ ation is that a diffusion process ov er graphs can b e deploy ed to quantify how the information flows across all agen ts for any giv en commu- nication graph, G t . The information flo wing process is conceptualised as the amoun t of energy that propagates throughout the netw ork [ 85 ]. Sp ecifically , w e deplo y the heat diffusion process: w e mimic the process of applying a source of heat ov er a net work and observ e how it v aries as a function of time. In our context, the heat transfer patterns reflect ho w efficien tly the information propagates at time t . First, we introduce a diagonal matrix D ( u ) of dimension N × N with diagonal elemen ts given b y D ( u, u ) = X v ∈ V s u,v , ∀ u ∈ V . Eac h suc h elemen t pro vides a measure of strength of node u . The Laplacian of the comm unication graph G is given b y L = D − S and its normalised v ersion is defined as ˆ L = 1 √ D L 1 √ D . The differential equation describing the heat diffusion pro cess o ver time p [ 60 , 86 ] is defined as ∂ H ( p ) ∂ p = − ˆ L H ( p ) . (3) Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 9 where H ( p ) is the fundamental solution represen ting the energy flo wing through the netw ork at time p . T o av oid confusion, the environmen t time-step is denoted by t whilst p indicates the time v ariable related to the diffusion pro cess. F or each pair of no des u and v , the corresp onding heat k ernel entry is giv en by H ( p ) u,v = φ exp[Λ p ] φ | = N X i =1 exp[ − λ i p ] φ i ( u ) φ i ( v ) (4) where H ( p ) u,v quan tifies the amount of heat that started in u and reached v at time p , φ i represen ts the i th eigen vector, φ = ( φ 1 , . . . , φ N ) is a matrix with the corresp onding eigen vectors as columns and Λ = diag ( λ 1 , . . . , λ V ) is a diagonal matrix formed b y the eigenv alues of S ordered by increasing magnitude. In practice, Eq. ( 4 ) is appro ximated using Pad ´ e appro ximant [ 87 ], H ( p ) = exp[ − p ˆ L ] . A useful property of H ( p ) is that it is differen tiable with resp ect to neu- ral net work parameters that define the Laplacian. This allo ws us to train an arc hitecture where all the relev an t quan tities are estimated end-to-end via bac kpropagation. Additional details are provided in Section 3.4 . W e lev erage this information to develop an atten tion mec hanism that iden- tifies the most imp ortan t messages within the system, given the current graph top ology . First, for every pair of no des, we identify the critical time point ˆ p at whic h the heat transfer drops b y a pre-determined percentage δ and b ecomes stable, i.e. for each pair of u and v , w e identify that critical v alue ˆ p ( u, v ) such that    H t ( p + 1) u,v − H t ( p ) u,v H t ( p ) u,v    < δ. (5) In practice, the searc h of these critical v alues is carried out ov er a uniform grid of points. Once these critical time p oin ts are identified, w e use them to ev aluate the HK v alues, and arrange them into an N × N matrix, H t u,v = H t ( ˆ p ( u, v )) whic h is used to define a multi-agen t message-passing mechanism. Specifically , the final information con tent (or message) for an agen t u is determined by a linear com bination of the lo cal messages receiv ed from all other agents, m t u = X v ∈ V H t u,v c t u,v (6) Springer Nature 2021 L A T E X template 10 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation where the HK v alues are used to w eight the importance of the incoming messages. Finally , the agent’s action dep ends deterministically by its message, a t u = ϕ θ p u ( m t u ) (7) where ϕ θ p u is a neural netw ork with parameters θ p u . A lac k of comm unication b et w een a pair of agents results when no stable HK v alues can b e found. In suc h cases, for a pair of agen ts ( u, v ), the corresp onding entry in H t u,v will b e zero hence no v alue of ˆ p ( u, v ) satisfies Eq. 5 . 3.4 He at k ernel: additional details and an illustration The heat kernel is a tec hnique from sp ectral geometry [ 63 ], and is a funda- men tal solution of the he at e quation : ∂ H t ( p ) ∂ p = − ˆ L t H t ( p ) . (8) Giv en a graph G defined on n v ertices, the normalized Laplacian ˆ L , acting on functions with Neumann b oundary conditions [ 88 ], is associated with the rate of heat dissipation. ˆ L can b e written as ˆ L = n − 1 X i =0 λ i I i where I i is the pro jection on to the i th eigenfunction φ i . F or a given time p ≥ 0, the heat k ernel H ( p ) is defined as a n × n matrix: H ( p ) = X i exp[ − λ i p ] I i = exp[ − p ˆ L ] . (9) Eq. 9 represen ts an analytical solution to Eq. 8 . F urthermore, the heat k ernel H(t) for a graph G with eigenfunctions θ i satisfies H ( p ) u,v = X i =1 exp[ − λ i p ] φ i ( u ) φ i ( v ) . The pro of follows from the fact that H ( p ) = X i exp[ − λ i p ] I i and I ( u, v ) = φ i ( u ) φ i ( v ) . Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 11 In this work the heat k ernel is used to in tro duce a mechanism for the selection of imp ortan t edges in a netw ork to supp ort comm unication b etw een no des. In this context, the imp ortance of an edge is determined b y b oth its w eight and the role it plays to allow agents to exchange information correctly in the netw ork structure. Figure 2 illustrates the adv antages of selecting edges through the heat kernel features ov er a naive thresholding approach. The heat diffusion considers the edge weigh ts as well as their relev ance within the graph structure, e.g. edge connecting t wo comm unities. (a) (b) (c) Fig. 2 : An illustration of t wo edge selection metho ds. Starting from graph (a), w e w an t to remov e the less relev ant edges. The relev ance of an edge is measured considering b oth its weigh t and structural role in allowing information to pass through the netw ork. The edge connecting no des 0 and 5, despite its relatively lo w weigh t (0.3), has an imp ortant structural role as it serv es as bridge connect- ing tw o comm unities hence allowing the information to propagate throughout the entire netw ork. In (b), removing edges with smaller weigh ts (e.g. all those falling b elo w the 40th p ercen tile of the edge weigh t distribution) results in the loss of the bridge. In (c), edges are selected based on the heat k ernel weigh ts, whic h recognise the imp ortance of the bridge. 3.5 Reinf orcemen t learning algorithm In this section, we describ e how the reinforcemen t learning algorithm is trained in an end-to-end fashion. W e extend the actor-critic framework [ 34 ] in whic h an actor pro duces actions and a critic provides feedback on the actors’ mo ves. In our architecture, multiple actors, one p er each agent, receive feedback from a single, cen tralised critic. In the standard DDPG algorithm [ 35 , 36 ], the actor µ θ : O 7→ A and the critic Q µ θ : O × A 7→ R are parametrised b y neural netw orks with the aim to Springer Nature 2021 L A T E X template 12 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation maximize the exp ected return, J ( θ ) = E h T X i =1 r ( o t , a t ) i . where θ is the set of parameters that characterise the return. The gradien t ∇ θ J ( θ ) required to up date the parameter vector θ is calculated as follows, ∇ θ J ( θ ) = E o t ∼D  ∇ θ µ θ ( o t ) ∇ a t Q µ θ ( o t , a t ) | a t = µ θ ( o t )  . whilst Q µ θ is obtained b y minimizing the following loss, L ( θ ) = E o t ,a t ,r t , o t +1 ∼D h  Q µ θ ( o t , a t ) − y  2 i where y = r t + γ Q µ 0 θ ( o t +1 , a t +1 ) . Here, Q µ 0 θ is a target critic whose parameters are only p eriodically up dated with the parameters of Q µ θ , whic h is utilised to stabilize the training. Our developmen ts follow the CLDE paradigm [ 20 , 37 , 47 ]. The critics are emplo yed during learning, but otherwise only the actor and communication mo dules are used at test time. A t training time, a centralised critic uses the observ ations and actions of all the agents to pro duce the Q v alues. In order to mak e the critic unique for all the agents and keep the num b er of parameters constan t, we approximate our Q function with a recurren t neural net w ork (RNN). W e treat the observ ation/action pairs as a sequence, z t i = RNN( o t i , a t i | z t i − 1 ) (10) where z t i and z t i − 1 are the hidden state pro duced for the i th and i − 1 th agen t, resp ectiv ely . Up on all the observ ation and action pairs from all the N agents are a v ailable, we use the last hidden state z t N to pro duce the Q -v alue: Q ( o t 1 , . . . , o t N , a t 1 , . . . , a t N ) = ϕ θ Q ( z t N ) where ϕ is a neural netw ork with parameters θ Q . The parameters of the i th agen t are adjusted to maximize the ob jective function J ( θ i ) = E [ R i ] following the direction of the gradien t J ( θ i ), ∇ θ i J ( θ i ) = E o t i ,a t i ,r t , o t +1 i ∼D  ∇ θ i µ θ i ( m t i ) ∇ a t i Q ( x ) | a t i = µ θ i ( m t i )  (11) where x = ( o t 1 , . . . , o t N , a t 1 , . . . , a t N ) and Q minimizes the temp oral difference error, i.e. L ( θ i ) = E o t i ,a t i ,r t , o t +1 i ∼D h ( Q ( x ) − y ) 2 i where y = r t i + γ Q ( o t +1 1 , . . . , o t +1 N , a t +1 1 , . . . , a t +1 N ) . Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 13 The differen tiability of the heat kernel op erator allo ws the gradient in Eq. ( 11 ) to b e ev aluated. Since the actions are mo delled b y a neural net w ork parametrised θ u in Eq.( 7 ), w e hav e that ∇ θ u µ θ u ( m t u ) = ∇ θ u ϕ θ u ( m t u ) . and from Eq.( 6 ) the gradien t is ∂ ϕ ( m t u ) ∂ θ u = ∂ ϕ  P v ∈ V H t u,v c t u,v  ∂ ϕ θ u = X v ∈ V ∂ ϕ ( H t u,v c t u,v ) ∂ ϕ θ u = X v ∈ V  ∂ ϕ ( H t u,v ) ∂ ϕ θ u c t u,v + H t u,v ∂ ϕ ( c t u,v ) ∂ ϕ θ u  . whilst the gradien ts of the HK v alues are ∂ ϕ ( H t u,v ) ∂ ϕ θ u = ∂ ϕ ( H t u,v ( ˆ p )) ∂ ϕ θ u = ∂ (exp[ − ˆ p ˆ L ]) ∂ ϕ θ u = ∂ (exp[ − ˆ p 1 √ D L 1 √ D ]) ∂ ϕ θ u = ∂ (exp[ − ˆ p 1 √ D ( D − S ) 1 √ D ]) ∂ ϕ θ u whic h is a comp osition of differentiable operations. Algorithm 1 summarises the learning algorithm; the prop osed architecture is presen ted in Figure 1 . 4 Exp erimen tal results 4.1 Environmen ts The p erformance of CDC has b een assessed in four differen t environmen ts. Three of them are commonly used sw arm robotic benchmarks: Navigation Contr ol , F ormation Contr ol and Line Contr ol [ 89 – 91 ]. A fourth one, Pack Con- tr ol , has b een added to study a more c hallenging task. All the environmen ts ha ve b een tested using the Multi-Agen t P article En vironment [ 37 , 92 ], which allo ws agen ts to mo ve around in t w o-dimensional spaces with discretised action spaces. In Navigation Contr ol there are N agents and N fixed landmarks. The agen ts m ust mov e closer to all landmarks whilst av oiding collisions. Landmarks Springer Nature 2021 L A T E X template 14 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation Algorithm 1 CDC 1: Inizialise actor ( µ θ 1 , . . . , µ θ N ) and critic net works ( Q θ 1 , . . . , Q θ N ) 2: Inizialise actor target netw orks ( µ 0 θ 1 , . . . , µ 0 θ N ) and critic target netw orks ( Q 0 θ 1 , . . . , Q 0 θ N ) 3: Inizialise repla y buffer D 4: for episode = 1 to E do 5: Reset environmen t, o 1 = o 1 1 , . . . , o 1 N 6: for t = 1 to T do 7: Generate C t (Eq. 1 ) and S t (Eq. 2 ) 8: for p = 1 to P do 9: Compute Heat Kernel H ( p ) t (Eq. 3 ) 10: end for 11: Build H t with stable Heat Kernel v alues (Eq. 5 ) 12: for agent i = 1 to N do 13: Produce agent’s message m t i (Eq. 6 ) 14: Select action a t i = µ θ i ( m t i ) 15: end for 16: Execute a t = ( a t 1 , . . . , a t N ), observe r and o t +1 17: Store transaction ( o t , a t , r, o t +1 ) in D 18: end for 19: for agent i = 1 to N do 20: Sample minibatch Θ of B transactions ( o , a , r , o 0 ) 21: Up date critic b y minimizing: 22: 23: L ( θ i ) = 1 B P ( o , a ,r, o 0 ) ∈ Θ ( y − Q ( o , a )) 2 , 24: where y = r i + γ Q ( o 0 , a 0 ) | a 0 k = µ 0 θ k ( m 0 k ) 25: in which m 0 k is global message computed using target net works 26: Up date actor according to the policy gradien t: 27: ∇ θ i J ≈ 1 B P  ∇ θ i µ θ i ( m i ) ∇ a i Q µ θ i ( o , a ) | a i = µ θ i ( m i )  28: end for 29: Up date target net works: 30: θ 0 i = τ θ i + (1 − τ ) θ 0 i 31: end for are not assigned to particular agents, and the agents are rewarded for min- imizing the distances b et ween their p ositions and the landmarks’ p ositions. Eac h agen t can observ e the p osition of all the landmarks and other agents. In F ormation Contr ol there are N agents and only one landmark. In this sce- nario, the agents must navigate in order to form a p olygonal geometric shape, whose shap e is defined by the N agents, and centred around the landmark. The agents’ ob jective is to minimize the distances b etw een their lo cations and the p ositions required to form the exp ected shap e. Each agen t can observe the landmark only . Line Contr ol is v ery similar to F ormation Contr ol with the dif- ference that the agents must navigate in order to position themselv es along the straigh t line connecting the t wo landmarks. Finally in Dynamic Pack Contr ol there are N agents, of whic h tw o are leaders and N − 2 are members, and one Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 15 landmark. The ob jective of this task is to sim ulate a pack b eha viour, where agen ts hav e to na vigate to reach the landmark. Once a landmark is o ccupied, it mo ves to a different lo cation. The landmark lo cation is accessible only to the leaders, while the members are blind, i.e. they can only see their curren t lo cation. Typical agen t configurations arising from each environmen t we use here are rep orted in Figure 3 . Springer Nature 2021 L A T E X template 16 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation (a) Na vigation Control N = 3 (b) Navigation Control N = 10 (c) F ormation Control N = 4 (d) F ormation Control N = 10 (e) Line Con trol N = 4 (f ) Line Control N = 10 (g) P ack Con trol N = 4 (h) P ack Con trol N = 8 Fig. 3 : Typical agent configurations for all our en vironments. Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 17 F or each en vironment w e hav e tested tw o versions with different n umber of agen ts: a b asic one fo cusing on solving the designed task when 3 − 4 agen ts are in volv ed, and a sc alable one to show the ability to succeed with 8 − 10 agents. The p erformance of comp eting MADRL algorithms has been assessed using a num ber of metrics: the r ewar d , which quan tifies how w ell a task has b een solv ed (the higher the b etter); the distanc e , whic h indicates the amoun t of na vigation carried out by the agents to solv e the task (the low er the b etter); the n umber of c ol lisions , whic h sho ws the abilit y to av oid collisions (the low er the b etter); the time required to solve the task (the low er the better); the suc c ess r ate , defined as the num b er of times an algorithm has solved a task ov er the total num b er of attempts; and c aught tar gets , which refers to the n umber of landmarks that the pac k managed to reach. Illustrative videos showing CDC in action on the ab o v e environmen ts can b e found online 1 . 4.2 Implementation details and exp erimen tal setup F or our exp eriments, we use neural netw orks with t wo hidden lay ers (64 eac h) to implement the graph generation mo dules (Eq. 2 , 1 ) and the action selector in Eq. 7 . The RNN describ ed in Equation 10 is implemen ted as a long-short term memory (LSTM) net work [ 93 ] with 64 units for the hidden state. W e use the Adam optimizer [ 94 ] with a learning rate of 10 − 3 for critic and 10 − 4 for p olicies. Similarly to [ 76 , 91 ], w e set θ 1 = θ 2 = · · · = θ N in order to make the mo del inv ariant to the n umber of agents. The rew ard discoun t factor is set to 0 . 95, the size of the repla y buffer to 10 6 , and the batch size to 1 , 024. At eac h iteration, w e calculate the heat kernel o ver a finite grid of P = 300 time p oin ts, with a threshold for getting stable v alues set to s = 0 . 05. This v alue has b een determined exp erimentally (see T able 4 ). The n umber of time steps for episo de, T , is set to 50 for all the environmen ts, except for Na vigation Control where is set to 25. F or F ormation Con trol, Line Control and P ack Control the num b er E of episodes is set to is set to 50 , 000 for the basic v ersions (30 , 000 for scalable versions), while for Navigation Control is set to 100 , 000 (30 , 000 for scalable v ersions). All net work parameters are updated every time 100 new samples are added to the replay buffer. Soft up dates with target netw orks use τ = 0 . 01. W e adopt the lo w-v ariance gradient estimator Gumbel-Softmax for discrete actions in order to allow the bac k-propagation to w ork prop erly with cat- egorical v ariable, which can truncate the gradient’s flow. All the presen ted results are pro duced b y running every experiment 5 times with differen t seeds (1,2001,4001,6001,8001) in order to a void that a particular c hoice of the seed can significan tly condition the final p erformance. Python 3.6.6 [ 95 ] with PyT orch 0.4.1 [ 96 ] is used as framework for mac hine learning and automatic differen tiable computing. Net workX 2.2 [ 97 ] has b een used for graph analysis. Computations were mainly p erformed using Intel(R) Xeon(R) CPU E5-2650 v3 at 2.30GHz as CPU and GeF orce GTX TIT AN X as GPU. With this 1 https://y outu.be/H9kMtrnvRCQ Springer Nature 2021 L A T E X template 18 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation Navigation Con trol N = 3 Navigation Con trol N = 10 Reward # collisions Distance Reward # collisions Distance DDPG − 57 . 3 ± 9 . 94 1 . 24 ± 0 . 39 4 . 09 ± 6 . 92 − 115 . 93 ± 21 . 26 8 . 83 ± 6 . 41 3 . 6 ± 0 . 85 MADDPG − 45 . 23 ± 6 . 59 0 . 77 ± 0 . 24 3 . 16 ± 5 . 74 − 112 . 17 ± 13 . 23 12 . 29 ± 7 . 45 3 . 44 ± 0 . 53 CommNet − 48 . 95 ± 6 . 25 0 . 92 ± 0 . 24 3 . 49 ± 5 . 09 − 104 . 49 ± 10 . 45 12 . 21 ± 6 . 87 3 . 14 ± 0 . 41 MAAC − 43 . 18 ± 6 . 44 0 . 71 ± 0 . 24 1 . 46 ± 2 . 97 − 107 . 38 ± 11 . 81 9 . 04 ± 6 . 46 3 . 26 ± 0 . 46 ST-MARL − 55 . 36 ± 8 . 17 1 . 54 ± 3 . 56 1 . 2 ± 0 . 33 − 110 . 69 ± 15 . 75 32 . 73 ± 32 . 77 3 . 27 ± 0 . 57 When2Com − 40 . 7 ± (5 . 33) 0 . 61 ± (0 . 21) 1 . 06 ± (3 . 26) − 112 . 51 ± (14 . 48) 13 . 68 ± (11 . 29) 3 . 45 ± (0 . 57) T arMAC − 44 . 9 ± (6 . 22) 0 . 77 ± (0 . 24) 2 . 14 ± (4 . 36) − 110 . 67 ± (13 . 76) 9 . 81 ± (7 . 66) 3 . 39 ± (0 . 54) IS − 42 . 6 ± (6 . 70) 0 . 70 ± (0 . 29) 1 . 22 ± (3 . 56) − 111 . 67 ± (9 . 18) 12 . 28 ± (7 . 27) 3 . 39 ± (0 . 68) CDC − 39 . 16 ± 4 . 77 0 . 56 ± 0 . 19 0 . 4 ± 1 . 66 − 102 . 68 ± 10 . 1 9 . 03 ± 9 . 36 3 . 06 ± 0 . 4 F ormation Control N = 4 F ormation Control N = 10 Reward Time Success Rate Reward Time Success Rate DDPG − 39 . 43 ± 12 . 37 50 ± 0 . 0 0 ± 0 . 0 − 49 . 27 ± 6 . 11 50 ± 0 . 0 0 ± 0 . 0 MADDPG − 19 . 86 ± 6 . 04 50 ± 0 . 0 0 ± 0 . 0 − 20 . 65 ± 7 . 11 50 ± 0 . 0 0 ± 0 . 0 CommNet − 7 . 77 ± 2 . 06 45 . 8 ± 10 . 19 0 . 18 ± 0 . 38 − 10 . 22 ± 1 . 03 48 . 89 ± 5 . 5 0 . 04 ± 0 . 2 MAAC − 5 . 77 ± 1 . 53 26 . 66 ± 17 . 2 0 . 66 ± 0 . 47 − 9 . 63 ± 1 . 35 50 ± 0 . 0 0 ± 0 . 0 ST-MARL − 20 . 24 ± 3 . 0 50 ± 0 . 0 0 ± 0 . 0 − 19 . 81 ± 5 . 74 50 ± 0 . 0 0 ± 0 . 0 When2Com − 17 . 00 − ± (4 . 16) 48 . 21 ± (10 . 11) 0 . 12 ± (0 . 31) − 18 . 49 ± (1 . 23) 48 . 72 ± (0 . 9) 0 . 01 ± (0 . 1) T arMAC − 14 . 25 ± (2 . 58) 47 . 35 ± (12 . 87) 0 . 13 ± (0 . 45) − 19 . 06 ± (1 . 23) 49 . 44 ± (5 . 6) 0 . 01 ± (0 . 1) IS − 18 . 72 ± (3 . 43) 49 . 79 ± (9 . 96) 0 . 1 ± (0 . 41) − 18 . 30 ± 4 . 36 50 ± 0 . 0 0 ± 0 . 0 CDC − 4 . 22 ± 1 . 46 11 . 82 ± 5 . 49 0 . 99 ± 0 . 12 − 7 . 51 ± 1 . 06 15 . 21 ± 9 . 23 0 . 99 ± 0 . 1 Line Control N = 4 Line Control N = 10 Reward Time Success Rate Reward Time Success Rate DDPG − 33 . 45 ± 10 . 58 49 . 99 ± 0 . 22 0 ± 0 . 0 − 68 . 19 ± 10 . 2 50 ± 0 . 0 0 ± 0 . 0 MADDPG − 18 . 75 ± 2 . 32 47 . 32 ± 9 . 14 0 . 08 ± 0 . 27 − 12 . 69 ± 2 . 11 48 . 48 ± 7 . 12 0 . 04 ± 0 . 21 CommNet − 10 . 99 ± 2 . 24 46 . 97 ± 8 . 93 0 . 12 ± 0 . 33 − 9 . 58 ± 1 . 28 37 . 73 ± 14 . 85 0 . 47 ± 0 . 5 MAAC − 7 . 38 ± 2 . 09 17 . 08 ± 12 . 17 0 . 89 ± 0 . 32 − 8 . 58 ± 1 . 52 22 . 55 ± 16 . 09 0 . 76 ± 0 . 43 ST-MARL − 23 . 87 ± 7 . 77 50 ± 0 . 0 0 ± 0 . 0 − 19 . 24 ± 6 . 26 50 ± 0 . 0 0 ± 0 . 0 When2Com − 16 . 45 ± (3 . 01) 46 ± (0 . 0) 0 . 11 ± (0 . 3) − 10 . 1 ± (2 . 8) 49 . 55 ± 4 . 24 0 . 01 ± (0 . 12) T arMAC − 17 . 75 ± (4 . 24) 47 . 00 ± (0 . 0) 0 . 09 ± (0 . 31) − 11 . 83 ± (1 . 63) 49 . 91 ± 1 . 12 0 . 01 ± (0 . 09) IS − 16 . 11 ± (4 . 24) 45 . 20 ± (0 . 0) 0 . 10 ± (0 . 15) − 11 . 90 ± (1 . 52) 49 . 84 ± 1 . 15 0 . 01 ± (0 . 03) CDC − 5 . 97 ± 1 . 73 10 . 42 ± 5 . 58 0 . 98 ± 0 . 13 − 7 . 96 ± 1 . 19 15 . 06 ± 12 . 02 0 . 91 ± 0 . 29 Dynamic Pac k Control N = 4 Dynamic Pac k Control N = 8 Reward Distance T argets caught Reward Distance T argets caught DDPG − 224 . 77 ± 87 . 65 3 . 52 ± 1 . 67 0 ± 0 . 0 − 279 . 67 ± 70 . 18 4 . 58 ± 1 . 4 0 ± 0 . 0 MADDPG − 116 . 15 ± 71 . 37 1 . 46 ± 0 . 72 0 . 2 ± 0 . 13 − 110 . 86 ± 28 . 66 1 . 22 ± 0 . 28 0 . 0 ± 0 . 05 CommNet 293 . 35 ± 446 . 89 1 . 11 ± 0 . 12 0 . 81 ± 0 . 89 − 76 . 18 ± 138 . 73 1 . 13 ± 0 . 25 0 . 07 ± 0 . 28 MAAC − 95 . 29 ± 61 . 65 1 . 25 ± 0 . 21 0 . 01 ± 0 . 12 − 105 . 15 ± 46 . 42 1 . 15 ± 0 . 28 0 . 01 ± 0 . 09 ST-MARL − 107 . 02 ± 71 . 84 1 . 26 ± 0 . 3 0 . 02 ± 0 . 14 − 123 . 91 ± 16 . 89 1 . 42 ± 0 . 36 0 ± 0 . 0 When2Com − 108 . 47 ± (73 . 58) 1 . 32 ± (0 . 33) 0 . 02 ± (0 . 14) − 111 . 47 ± (73 . 58) 1 . 32 ± (0 . 33) 0 . 02 ± (0 . 14) T arMAC 50 . 47 ± (73 . 58) 1 . 20 ± 0 . 21 0 . 3 ± 0 . 55 − 78 . 18 ± 42 . 5 1 . 18 ± 0 . 76 0 . 05 ± 0 . 21 IS 235 . 74 ± 446 . 89 1 . 06 ± 0 . 35 0 . 80 ± 0 . 63 50 . 19 ± 310 . 44 1 . 10 ± 0 . 29 0 . 34 ± 0 . 98 CDC 369 . 5 ± 463 . 92 1 . 09 ± 0 . 1 0 . 96 ± 0 . 93 58 . 03 ± 279 . 05 1 . 12 ± 0 . 14 0 . 35 ± 0 . 56 T able 1 : Comparison of DDPG, MADDPG, CommNet, MAAC, ST-MARL, When2Com, T arMAC, IS and CDC on all environmen ts. N is the num ber of agen ts. Results are av eraged ov er fiv e different seeds. configuration, the proposed CDC in a verage took appro ximately 8.3 hours to complete a training pro cedure on environmen ts with four agents inv olved. 4.3 Main results W e ha v e compared CDC against several different baselines, eac h one represen t- ing a different wa y to approach the MA coordination problem: independent DDPG [ 35 , 36 ], MADDPG [ 37 ], CommNet [ 21 ], MAA C [ 29 ], ST-MARL [ 76 ], When2Com [ 98 ] and T arMAC [ 28 ] and In tention Sharing (IS 2 ) [ 58 ]. Inde- p enden t DDPG pro vides the simplest baseline in that each agent works indep enden tly to solve the task. In MADDPG eac h agent has its own critic with access to com bined observ ations and actions from all agen ts during learn- ing. CommNet implements an explicit form of communication; the p olicies are 2 In our implementation, the n umber of steps to be predicted is set to one, i.e. eac h agen t predicts the next step of every other agen t. In the original paper, this is the equiv alent to IS(H=1). In addition, in order to maintain a fair comparison with the other baselines, a message at time t is used to generate the next actions, i.e. we do not rely on previously generated messages. Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 19 implemen ted through a large neural netw ork with some comp onen ts of the net works shared across all the agents and others agent-specific. At ev ery time- step eac h agen t’s action dep ends on the local observ ation, and on the a verage of all other p olicies (neural netw ork hidden states), used as messages. MAAC is a state-of-the art method in which an attention mec hanism guides the crit- ics to select the information to b e shared with the actors. ST-MARL uses a graph neural netw ork to capture the spatio-temp oral dependency of the obser- v ations and facilitate co operation. Unlike our approach, the graph edges here represen ts the time-dep ending agen ts’ relationships, and capture the spatial and temp oral dep endencies amongst agen ts. When2Com utilises an atten tional mo del to compute pairwise similarities betw een the agents’ observ ation enco d- ings, which results in a fully connected graph that is subsequently sparsified b y a thresholding op eration. Afterwards, each agen t uses the remaining simi- larities scores to weigh t its neighbor observ ations b efore pro ducing its action. T arMac is a framework where the agen ts broadcast their messages and then select whom to communicate to by aggregating the received communications together through an attention mechanism. In IS [ 58 ] the agents generate their future in tentions b y simulating their tra jectory and then an attention mo del aggregate this information together to share it with the others. Differen tly from the metho ds ab o ve, CDC utilises graph structures to supp ort the for- mation of communication connectivities and then use the heat kernel, as an alternativ e form of attention mechanism, to allo w to each agent to aggregate the messages coming from the others. Type of communication How information is aggregated Has a graph-based architecture Is communication delay ed DDPG NA NA No NA MADDPG Implicit Observ ation and action concatenation No Y es CommNet Explicit Sharing neural-netw orks hidden states No No MAAC Implicit Atten tion No Y es ST-MARL Implicit RNN + A ttention Y es No When2Com Implicit A ttention Y es No T arMAC Explicit Atten tion No Y es IS Explicit Attention No No CDC Explicit Heat Kernel Y es No T able 2 : A comparative summary of v arious MARL algorithms according to ho w communication is implemen ted. In T able 2 we provide a summary of selected features for each MADRL algorithm used in this w ork. First, w e hav e indicated whether the commu- nication is implicit or explicit. The former refers to the ability to share information without sending explicit messages, i.e. communication is inherited from a certain b eha viour rather than b eing delib erately shared [ 99 ]; studies ha ve shown that this approach is used by b oth animals and humans [ 100 – 102 ] Springer Nature 2021 L A T E X template 20 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation and has discussed in a n umber of multi-agen t reinforcemen t learning w orks [ 21 , 22 , 28 , 57 , 77 , 103 , 104 ]. Explicit communication assumes the existence of a sp ecific mechanism deliberatively in tro duced to share information within the system; this is considered to b e the most common form of human com- m unication [ 105 , 106 ] and has also b een widely explored in the context of reinforcemen t learning [ 20 , 23 , 58 , 98 ]. This categorization can help in ter- pret the p erformance achiev ed in certain environmen ts, such as Dynamic Pac k Con trol, where explicit communication is more beneficial. W e also rep ort on ho w the information is aggregated amongst agen ts, whether the algorithm relies on a graph-based architecture, and whether the comm unication con ten t is dela y ed, i.e . it only utilised in the future but do es not affect the curren t actions. F or example, in T arMac, eac h message is broadcasted and utilised by the agents in the next step, while in MAAC and MADDPG the communication happ ens through the critics and affect future actions once the p olicy parameters get up dated. T able 1 summarises the experimental results obtained from all algorithms across all the environmen ts. The metric v alues are obtained by executing the b est mo del (c hosen according to the b est av erage rew ard returned during training) for an additional 100 episo des. W e rep eated each experiment using 5 differen t seeds, and each en try of T able 1 is an av erage o ver 500 v alues. Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 21 Na vigation Control N = 3 Na vigation Control N = 10 Line Con trol N = 4 Line Con trol N = 10 F ormation Control N = 4 F ormation Control N = 10 Dynamic P ack Con trol N = 4 Dynamic P ack Con trol N = 8 Fig. 4 : Learning curves for 9 comp eting algorithms assessed on Na vigation Con trol, Line Control, F ormation Control and Dynamic P ack Control. Hor- izon tal axes report the n umber of episodes, while v ertical axes the achiev ed rew ards. Results are av eraged ov er fiv e different runs. Springer Nature 2021 L A T E X template 22 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation It can b e noted that CDC outp erforms all the comp etitors on all four envi- ronmen ts on all the metrics. In Navigation Control ( N = 3), the task is solved b y minimizing the ov erall distance trav elled and the num b er of collisions, with an improv emen t o ver MAAC. In F ormation Con trol ( N = 4), the best p er- formance is also achiev ed by CDC, whic h alwa ys succeeded in half of time compared to MAA C. When the n umber of agents is increased, and the level of difficulty is sig- nifican tly higher, all the baselines fail to complete the task whilst CDC still main tains excellen t p erformance with a success rate of 0 . 99. In Line Control, b oth scenarios ( N = 4 and N = 10) are efficiently solved by CDC with higher success rate and less time compared to MAAC, while all other algorithms fail. F or Dynamic Pac k Control, amongst the comp etitors, only CommNet do es not fail. In this en vironment, only the leaders can see the point of in terest, hence the other agen ts must learn how to communicate with them. In this case, CDC also outp erforms CommNet on b oth the num ber of targets that are b eing caught and trav elled distance. Overall, it can b e noted that the gains in p erformance achiev ed b y CDC, compared to other metho ds, significantly increase when increasing the n umber of agents. Learning curves for all the environmen ts, a v eraged o v er fiv e runs, are shown in Figure 4 . Here it can be noticed that CDC reaches the highest reward o verall. The Dynamic P ack Con trol task is particularly interesting as only three metho ds are capable of solving it, CommNet, IS and CDC, and all of them implemen t explicit communication mec hanisms. The high v ariance asso ciated with CDC and CommNet in Dynamic P ack Control can b e explained by the fact that, when a landmark is reached by all the agen ts, the en vironment returns a higher rew ard. These are the only tw o metho ds capable of solving the task, and lo wer v ariance is associated to other metho ds that p erform p oorly . The p erformance of CDC when v arying the n umber of agents at execution time is in vestigated (see Appendix, Section A ). 4.4 Communication analysis In this section, we provide a qualitativ e ev aluation of the communication pat- terns and asso ciated top ological structures that ha ve emerged using CDC on the four en vironments. Figure 5 and 6 sho w the comm unication netw orks G t H ev olving ov er time at a given episo de during execution: blac k circles repre- sen t the landmarks, blue circles indicate the normal agents, and the red circles are the leaders. Their co ordinates within the t wo-dimensional area indicate the na vigation tra jectories. The lines connecting pairs of agents represent the time-v arying edge weigh ts, H t . Eac h H t u,v elemen t quantifies the amount of diffused heat b et w een the tw o no des. Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 23 (a) Na vigation Control N = 3 (b) Na vigation Control N = 10 (c) Line Con trol N = 4 (d) Line Con trol N = 10 Fig. 5 : Examples of communication netw orks G t ev olving ov er different episo de time-steps on Navigation Control and Line Control. Black circles rep- resen t landmarks; agents are represented in blue. Connections indicate the heat k ernel connectivity w eights generated by CDC. Springer Nature 2021 L A T E X template 24 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation (a) F ormation Control N = 4 (b) F ormation Control N = 10 (c) Dynamical P ack Con trol N = 4 (d) Dynamical P ack Con trol N = 8 Fig. 6 : Examples of communication netw orks G t ev olving ov er different episo de time-steps on F ormation Control and Dynamic P ack Control. Blac k circles describe landmarks; agen ts are represen ted in blue, leader agents in red. Connections indicate the heat kernel connectivity weigh ts generated by CDC. As exp ected, different patterns emerge in differen t environmen ts; see Figure 5 and Figure 6 . F or instance, in F ormation Con trol, the dynamic graphs are dense in the early stages of the episo des, and b ecome sparser later on when the formation is found. The degree of top ological adjustment observed o ver time indicate initial bursts of communication activity at the beginning of an episo de; to w ards the end the communication, this seems to ha ve stabilised and consists of messages shared only across neighbours, whic h seems to b e suffi- cien t to maintain the polygonal shap e. A differen t situation can be observed in Dynamic P ack Con trol; see Figure 6 (f ). Here, there is an in tense communica- tion activit y b et ween leaders and members at an early stage, and the emerging top ology approximates a bipartite graph b et ween red and blue no des. This is an expected and plausible pattern, giv en the nature of this environmen t; the Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 25 leaders need to share information with the mem b ers, which otherwise would not kno w b e able to lo cate the landmarks. In addition to the ab ov e qualitativ e in terpretation based on graph top olo- gies, we can also quan tify the emergence of differen t communication patterns b y lo oking at changes in the statistics of the degree centralit y (i.e. the num b er of connections of each agent) ov er time. Sp ecifically , we compare the statis- tics attained at the b eginning and end of an episo de using the connectivity graph generated b y CDC. T able 3 shows the mean and v ariance of the cen- tralit y degree, across all no des, for each environmen t. Changes in v ariance, for instance, ma y indicate the formation of clusters. Here it can b e noted that in Navigation Control, Line Control and F ormation Con trol, the v ariance is significan tly lo wer at the end of the episo des; this is exp ected since the best strategy in suc h tasks consists of spreading the n umber of connections across all no des. A differen t pattern emerges in Dynamic Pac k Control where the formation of clusters is necessary since the work ers need to connect with the leaders. These clusters are also visible in Figure 6 (f ). Average Degree Cen trality Environmen t Beginning of episo de End of episo de Navigation Control N = 10 1 . 7 ± (1 . 5) 2 . 4 ± (0 . 5) Line Con trol N = 10 2 . 5 ± (0 . 9) 1 . 8 ± (0 . 4) F ormation Control N = 10 2 . 2 ± (1 . 7) 2 . 1 ± (0 . 3) Dynamic P ack Control N = 8 1 . 4 ± (0 . 91) 1 . 6 ± (1 . 4) T able 3 : Mean and standard deviation for the centralit y degree calculated using the connectivity graphs generated by CDC. Metrics are calculated util- ising the graph pro duced in the first (b eginning) and last step (end) of the episo des at execution time. Springer Nature 2021 L A T E X template 26 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation (a) Na vigation Control N = 3 (b) Na vigation Control N = 10 (c) Line Con trol N = 4 (d) Line Con trol N = 10 (e) F ormation Control N = 4 (f ) F ormation Control N = 10 (e) Dynamic P ack Con trol N = 4 (f ) Dynamic Pac k Con trol N = 8 Fig. 7 : Averaged communication graphs for all the en vironments. On the left side of each figure, the no de sizes describ e the eigen vector centralit y , the connections represent the heat k ernel v alues and the num b ers indicate the node lab els. On the right, the heat kernel v alues are shown as heatmaps, where axis n umbers corresp ond to no de lab els. F urther appreciation for the role play ed by the heat kernel in driving the comm unication strategy can b e gained by observing Figure 7 whic h provides visualisations for all the environmen ts. On the left, the connection weigh ts are Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 27 visualised using a circular la y out. Here the no des represen t agen ts, and the size of each node is prop ortional to the no de’s eigen vector cen tralit y . The eigenv ec- tor cen trality is a popular graph sp ectral measure [ 107 ], utilised to determine the influence of a no de considering b oth its adjacen t connections and the imp ortance of its neighbouring no de. This measure is calculated using the sta- ble heat diffused v alues a veraged ov er an episo de, i.e. H u,v = ( P T t =1 H t u,v ) /T . The resulting graph structure reflects the o verall comm unication patterns emerged while solving the given tasks. On the righ t, w e visualise the squared N × N matrix of a veraged pairwise diffusion v alues as a heatmap (red v alues are higher). It can be noted that, in Pac k Con trol, t wo communities of agen ts are formed, eac h one with a leader. Here, as exp ected, leaders app ear to b e influen tial no des (red no des), and the heatmap shows that the connections b et w een individual members and leaders are very strong. A different pattern emerges instead in F ormation Control, where there is no evidence of com- m unities since all no des are connected to nearly form a circular shap e. The corresp onding heatmap sho ws the heat kernel v alues connecting neighbouring agen ts tend to assume higher v alues compared to more distant agents. 4.5 Ablation studies W e ha ve carried out a num b er of studies to assess the relative imp ortance of eac h new component contributing to CDC. First, we in vestigate the relativ e merits of the heat kernel o ver tw o alternative and simpler information prop- agation mec hanisms: (a) a glob al aver age approac h, where the observ ations of all other agen ts are av eraged and provided to the agent to inform its action, and (b) the ne ar est neighb ours approach, where only the observ ations of the agen t’s tw o nearest neighbours are av eraged. F or each one of these tw o mecha- nisms, w e compare a version using our proposed critic (Section 3.5 ), which uses a recurren t architecture (specifically an LSTM), and a version using a tradi- tional critic, i.e. based on a feed-forw ard neural net work. T o b etter c haracterise the b enefits of a recurrent netw ork, we hav e also inv estigated an LSTM-based v ersion of MADDPG. In addition, w e ha ve implemented a v ersion of CDC that use a softmax attention , i.e. the heat kernel connectivity weigh ts ha v e b een replaced b y a softmax function. T o ensure a fair comparison, only the neces- sary arc hitectural c hanges ha v e been carried out in order to keep the modelling capacit y across different v ersions comparable. In Figure 8 , it can b e noted that the prop osed CDC using the heat kernel ac hieves the highest performance by a significan t margin. The other mo di- fied v ersions of CDC, with and without LSTM, also outp erform the simpler comm unication metho ds. There is evidence to suggest that av eraging lo cal information coming from the nearest neighbours is a b etter strategy compared to using a global av erage; the latter cannot discard unnecessary information and results in nosier embeddings and w orse communication. Overall, w e hav e observ ed that the LSTM-based critic is b eneficial compared to the simpler alternativ e. This is an exp ected result because, b y design, the LSTM’s hidden state filters out irrelev ant information conten t from the sequence of inputs. Springer Nature 2021 L A T E X template 28 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation Fig. 8 : Learning curves of differen t versions of the prop osed mo del on F ormation Control ( N = 4). Another observ ed finding is that the order of the agents do es not affect the final p erformance of the mo del. This is explained by the fact that eac h of LSTM-based critics observe the en tire sequence of observ ations and actions b efore pro ducing the feedback to return. F urthermore, the softmax version of CDC has been found to be less performant that the original CDC th us confirm- ing the important role pla yed b y the heat kernel in aggregating the messages across the comm unication netw ork. In order to choose an appropriate threshold for the heat kernel equation (see Eq. 5 ) we hav e run a set of exp eriments whereby we monitor ho w the success rate b eha ves using differen t parameter v alues. T able 4 rep orts on the p erformance of CDC on F ormation Control when the threshold parameter s v aries o v er a grid of p ossible v alues. In turn, this threshold determines whether the heat kernel v alues are stable or not. The b est performance is obtained using s = 0 . 05, whic h is the v alue used in all our experiments. T o select the sp ecific thresholds rep orted in T able 4 , we tried a range of v alues suggested in related w orks [ 33 , 108 ]. Method F ormation Control N = 4 Reward Time Success Rate CDC s = 0 . 01 − 4 . 48 ± (1 . 62) 13 . 52 ± (9 . 83) 0 . 93 ± (0 . 21) CDC s = 0 . 025 − 4 . 33 ± (1 . 28) 14 . 01 ± (9 . 74) 0 . 94 ± (0 . 24) CDC s = 0 . 05 − 4 . 22 ± (1 . 46) 11 . 82 ± (5 . 49) 0 . 99 ± (0 . 1) CDC s = 0 . 075 − 4 . 34 ± (1 . 43) 12 . 88 ± (9 . 13) 0 . 95 ± (0 . 22) CDC s = 0 . 1 − 4 . 31 ± (1 . 57) 12 . 52 ± (8 . 39) 0 . 96 ± (0 . 2) T able 4 : Comparison of CDC results using differen t v alues for threshold s . Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 29 5 Conclusions In this w ork, w e hav e presented a no vel approac h to deep multi-agen t reinforce- men t learning that mo dels agen ts as no des of a state-dep enden t graph, and uses the o verall topology of the graph to facilitate communication and coop eration. The inter-agen t comm unication patterns are represented by a connectivity graph that is used to decide whic h messages should be shared with others, ho w often, and with whom. A k ey no v elty of this approac h is represen ted b y the fact that the graph topology is inferred directly from observ ations and is utilised as an attention mec hanism guiding the agents throughout the sequential decision pro cess. Unlik e other recently prop osed architectures that rely on graph con- v olutional net works to extract features, but we mak e use of a graph diffusion pro cess to sim ulate how the information propagates o ver the comm unication net work and is aggregated. Our exp erimen tal results on four different en viron- men ts hav e demonstrated that, compared to other state-of-the-art baselines, CDC can ac hieve sup erior p erformance on navigation tasks of increasing com- plexit y , and remark ably so when the num b er of agents increases. W e hav e also found that visualising the graphs learnt by the agen ts can shed some light on the role play ed by the diffusion pro cess in mediating the communication strat- egy that ultimately yields highly rewarding p olicies. The current LSTM-based critic could potentially b e replaced by a graph neural netw ork equipp ed with an attention mec hanism capable of tailoring individual feedbac k according to the agen ts’ needs. This w ork represents an initial attempt to leverage well-kno wn graph- theoretical prop erties in the context of a multi-agen t communication strategy , and pa v es the w ay for future exploration along related directions. F or instance, further constrain ts could be imp osed on the graph edges to regulate the ov er- all communication pro cess, e.g. using a notion of flo w conserv ation [ 109 ]. F urther in vestigations could b e directed to w ards the effects of adopting a decen tralised critic mo delling the comm unication conten t together with the agen ts’ state-action v alues to provide a richer individual feedbac k. Supplemen tary information. Supplemen tary material is provided in the App endix as suggested by the pro vided template. Statemen ts and Declarations F unding GM ackno wledges support from a UKRI AI T uring Acceleration F ellowship (EPSR C EP/V024868/1). Conflict of in terest/Comp eting interests No comp eting and finacial interests to disclose. Springer Nature 2021 L A T E X template 30 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation Ethics appro v al Not applicable. Consen t to participate The authors giv e their consent to participate. Consen t for publication The authors giv e their consent for publication. Av ailabilit y of data and materials En vironments will be made av ailable up on pap er publication. Av ailabilit y of data and materials All co de will b e made av ailable up on paper publication. Authors’ con tributions Authors’ con tributions follow the authors’ order con ven tion. App endix App endix A V arying the n um b er of agen ts # agents DDPG CDC 3 2 . 34 ± 0 . 61 1 . 06 ± 0 . 12 4 3 . 52 ± 1 . 67 1 . 09 ± 0 . 1 5 3 . 90 ± 1 . 68 1 . 08 ± 0 . 15 6 4 . 44 ± 1 . 7 1 . 08 ± 0 . 18 7 5 . 21 ± 1 . 98 1 . 12 ± 0 . 12 8 6 . 49 ± 2 . 17 1 . 13 ± 0 . 11 T able A1 : Comparison of DDPG and CDC on Dynamic P ack Control. Both algorithms w ere trained with 4 agen ts and tested with 3-8. The p erformance metric used here is the distance of the the farthest agent to the landmark. W e tested whether CDC is capable of handling a differen t n umber of agen ts at test time. T able A1 shows how the p erformance of DDPG and CDC com- pares when they are b oth trained using 4 learners, but 3-8 agen ts are used at test time. W e rep ort on the maximum distance b et ween the farthest agen t and the landmark, which is inv ariant to the n umber of agents. It can be noted Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 31 that CDC can handle systems with a v arying num b er of agents, outp erform- ing DDPG and keeping the final p erformance comp etitiv e with other metho ds that ha ve been trained with a larger num b er of agen ts (see T able 1 ). References [1] Sutton, R.S., Barto, A.G.: Introduction to reinforcemen t learning. MIT press Cam bridge (1998) [2] LeCun, Y., Bengio, Y., Hin ton, G.: Deep learning. nature 521 (7553), 436–444 (2015) [3] Schmidh ub er, J.: Deep learning in neural netw orks: An ov erview. Neural net works 61 , 85–117 (2015) [4] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., V an Den Driessche, G., Sc hrittwieser, J., An tonoglou, I., P anneershelv am, V., Lanctot, M., et al. : Mastering the game of go with deep neural netw orks and tree searc h. nature 529 (7587), 484 (2016) [5] Mnih, V., Ka vukcuoglu, K., Silv er, D., Rusu, A.A., V eness, J., Belle- mare, M.G., Gra ves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al. : Human-level control through deep reinforcemen t learning. Nature 518 (7540), 529 (2015) [6] Viny als, O., Babusc hkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Ch ung, J., Choi, D.H., Po well, R., Ewalds, T., Georgiev, P ., et al.: Grand- master level in starcraft ii using m ulti-agent reinforcemen t learning. Nature, 1–5 (2019) [7] T anner, H.G., Kumar, A.: T ow ards decen tralization of multi-robot na vigation functions. In: Pro ceedings of the 2005 IEEE International Conference on Rob otics and Automation, pp. 4132–4137 (2005). IEEE [8] Brunet, C.-A., Gonzalez-Rubio, R., T etreault, M.: A m ulti-agent archi- tecture for a driver mo del for autonomous road vehicles. In: Pro ceedings 1995 Canadian Conference on Electrical and Computer Engineering, v ol. 2, pp. 772–775 (1995). IEEE [9] Dresner, K., Stone, P .: Multiagent traffic managemen t: A reserv ation- based intersection con trol mec hanism. In: Pro ceedings of the Third In ternational Joint Conference on Autonomous Agents and Multiagen t Systems-V olume 2, pp. 530–537 (2004). IEEE Computer So ciet y [10] Lee, J.-H., Kim, C.-O.: Multi-agent systems applications in manufactur- ing syste ms and supply chain managemen t: a review paper. In ternational Journal of Pro duction Research 46 (1), 233–265 (2008) Springer Nature 2021 L A T E X template 32 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation [11] Hernandez-Leal, P ., Kaisers, M., Baarslag, T., de Cote, E.M.: A survey of learning in multiagen t environmen ts: Dealing with non-stationarity . arXiv preprin t arXiv:1707.09183 (2017) [12] Rahaie, Z., Beigy , H.: T ow ard a solution to multi-agen t credit assign- men t problem. In: 2009 International Conference of Soft Computing and P attern Recognition, pp. 563–568 (2009). IEEE [13] Harati, A., Ahmadabadi, M.N., Araabi, B.N.: Knowledge-based multi- agen t credit assignment: A study on task type and critic information. IEEE systems journal 1 (1), 55–67 (2007) [14] Yliniemi, L., T umer, K.: Multi-ob jective multiagen t credit assignmen t through difference rew ards in reinforcemen t learning. In: Asia-Pacific Conference on Simulated Evolution and Learning, pp. 407–418 (2014). Springer [15] Agogino, A.K., T umer, K.: Unifying temp oral and structural credit assignmen t problems. In: AAMAS, vol. 4, pp. 980–987 (2004) [16] V orob eychik, Y., Jo veski, Z., Y u, S.: Does comm unication help p eople co ordinate? PloS one 12 (2), 0170780 (2017) [17] Demichelis, S., W eibull, J.W.: Language, meaning, and games: A mo del of communication, co ordination, and evolution. American Economic Review 98 (4), 1292–1311 (2008) [18] Miller, J.H., Moser, S.: Comm unication and co ordination. Complexit y 9 (5), 31–40 (2004) [19] Kearns, M.: Exp erimen ts in social computation. Communications of the A CM 55 (10), 56–67 (2012) [20] F o erster, J., Assael, I.A., de F reitas, N., Whiteson, S.: Learning to com- m unicate with deep m ulti-agent reinforcement learning. In: Adv ances in Neural Information Pro cessing Systems, pp. 2137–2145 (2016) [21] Sukhbaatar, S., F ergus, R., et al. : Learning m ultiagent communication with bac kpropagation. In: Adv ances in Neural Information Processing Systems, pp. 2244–2252 (2016) [22] Singh, A., Jain, T., Sukh baatar, S.: Learning when to communicate at scale in m ultiagent coop erativ e and comp etitiv e tasks. ICLR (2019) [23] Pesce, E., Montana, G.: Impro ving coordination in multi-agen t deep reinforcemen t learning through memory-driv en communication. Deep Reinforcemen t Learning W orkshop, (NeurIPS 2018), Mon treal, Canada Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 33 (2019) [24] Jiang, J., Lu, Z.: Learning attentional comm unication for m ulti-agent co operation. arXiv preprint arXiv:1805.07733 (2018) [25] Mao, H., Zhang, Z., Xiao, Z., Gong, Z.: Mo delling the dynamic joint p olicy of teammates with attention multi-agen t ddpg. arXiv preprint arXiv:1811.07029 (2018) [26] Liu, Y., W ang, W., Hu, Y., Hao, J., Chen, X., Gao, Y.: Multi-agen t game abstraction via graph attention neural netw ork. In: AAAI, pp. 7211–7218 (2020) [27] Hoshen, Y.: V ain: Atten tional multi-agen t predictiv e mo deling. In: Adv ances in Neural Information Pro cessing Systems, pp. 2701–2711 (2017) [28] Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., Pineau, J.: T armac: T argeted multi-agen t comm unication. arXiv preprint arXiv:1810.11187 (2018) [29] Iqbal, S., Sha, F.: Actor-attention-critic for m ulti-agent reinforcement learning. ICML (2019) [30] W ang, T., W ang, J., Zheng, C., Zhang, C.: Learning nearly decomp os- able v alue functions via communication minimization. arXiv preprint arXiv:1910.05366 (2019) [31] Zhang, F., Hancock, E.R.: Graph spectral image smoothing using the heat k ernel. Pattern Recognition 41 (11), 3328–3342 (2008) [32] Chung, A.W., Pesce, E., Monti, R.P ., Mon tana, G.: Classifying hcp task- fmri net w orks using heat kernels. In: 2016 In ternational W orkshop on P attern Recognition in NeuroImaging (PRNI), pp. 1–4 (2016). IEEE [33] Chung, A.W., Sc hirmer, M., Krishnan, M.L., Ball, G., Aljabar, P ., Edw ards, A.D., Mon tana, G.: Characterising brain net w ork top ologies: a dynamic analysis approach using heat kernels. Neuroimage 141 , 490–501 (2016) [34] Degris, T., White, M., Sutton, R.S.: Off-p olicy actor-critic. arXiv preprin t arXiv:1205.4839 (2012) [35] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic p olicy gradient algorithms. In: ICML (2014) [36] Lillicrap, T.P ., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., T assa, Y., Silv er, D., Wierstra, D.: Contin uous control with deep reinforcemen t Springer Nature 2021 L A T E X template 34 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation learning. CoRR abs/1509.02971 (2015) [37] Low e, R., W u, Y., T amar, A., Harb, J., Abb eel, O.P ., Mordatch, I.: Multi- agen t actor-critic for mixed co op erativ e-comp etitiv e en vironments. In: Adv ances in Neural Information Pro cessing Systems, pp. 6379–6390 (2017) [38] Stone, P ., V eloso, M.: Multiagent systems: A surv ey from a machine learning p erspective. Autonomous Rob ots 8 (3), 345–383 (2000) [39] Parsons, S., W o oldridge, M.: Game theory and decision theory in m ulti- agen t systems. Autonomous Agents and Multi-Agent Systems 5 (3), 243– 254 (2002) [40] Shoham, Y., Leyton-Bro wn, K.: Multiagen t systems: Algorithmic, game- theoretic, and logical foundations. Cam bridge Universit y Press (2008) [41] Nguyen, T.T., Nguy en, N.D., Nahav andi, S.: Deep reinforcemen t learn- ing for m ultiagent systems: A review of challenges, solutions, and applications. IEEE transactions on cyb ernetics (2020) [42] Hernandez-Leal, P ., Kartal, B., T aylor, M.E.: A survey and critique of m ultiagent deep reinforcement learning. Autonomous Agents and Multi- Agen t Systems 33 (6), 750–797 (2019) [43] Albrech t, S.V., Stone, P .: Autonomous agen ts mo delling other agents: A comprehensiv e survey and op en problems. Artificial Intelligence 258 , 66–95 (2018) [44] Busoniu, L., Babusk a, R., De Sch utter, B.: A comprehensiv e surv ey of m ultiagent reinforcement learning. IEEE T ransactions on Systems, Man, and Cyb ernetics, Part C (Applications and Reviews) 38 (2), 156–172 (2008) [45] T uyls, K., W eiss, G.: Multiagent learning: Basics, challenges, and prosp ects. Ai Magazine 33 (3), 41 (2012) [46] Laurent, G.J., Matignon, L., F ort-Piat, L., et al. : The world of inde- p enden t learners is not marko vian. International Journal of Knowledge- based and In telligent Engineering Systems 15 (1), 55–64 (2011) [47] Kraemer, L., Banerjee, B.: Multi-agen t reinforcement learning as a rehearsal for decentralized planning. Neuro computing 190 , 82–94 (2016) [48] F o erster, J., F arquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Coun- terfactual multi-agen t p olicy gradients. arXiv preprint (2017) Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 35 [49] W ang, R.E., Ev erett, M., Ho w, J.P .: R-maddpg for partially observ able en vironments and limited comm unication. arXiv preprint arXiv:2002.06684 (2020) [50] Ho chreiter, S., Sc hmidhuber, J.: Long short-term memory . Neural com- putation 9 (8), 1735–1780 (1997) [51] Lin, K., Zhao, R., Xu, Z., Zhou, J.: Efficien t large-scale fleet manage- men t via multi-agen t deep reinforcemen t learning. In: Pro ceedings of the 24th ACM SIGKDD In ternational Conference on Knowledge Discov ery & Data Mining, pp. 1774–1783 (2018) [52] Scardovi, L., Sepulchre, R.: Synchronization in netw orks of identical lin- ear systems. In: Decision and Control, 2008. CDC 2008. 47th IEEE Conference On, pp. 546–551 (2008). IEEE [53] W en, G., Duan, Z., Y u, W., Chen, G.: Consensus in multi-agen t systems with communication constrain ts. International Journal of Robust and Nonlinear Con trol 22 (2), 170–182 (2012) [54] W under, M., Littman, M., Stone, M.: Communication, credibility and negotiation using a cognitiv e hierarch y mo del. In: W orkshop# 19: MSDM 2009, p. 73 (2009) [55] It¯ o, T., Zhang, M., Robu, V., F atima, S., Matsuo, T., Y amaki, H.: Inno v ations in Agent-Based Complex Automated Negotiations. Springer (2011) [56] F ox, D., Burgard, W., Kruppa, H., Thrun, S.: A probabilistic approach to collab orativ e m ulti-robot lo calization. Autonomous robots 8 (3), 325–344 (2000) [57] Peng, P ., Y uan, Q., W en, Y., Y ang, Y., T ang, Z., Long, H., W ang, J.: Multiagen t bidirectionally-co ordinated nets for learning to pla y starcraft com bat games. arXiv preprint arXiv:1703.10069 (2017) [58] Kim, W., Park, J., Sung, Y.: Communication in multi-agen t rein- forcemen t learning: Inten tion sharing. In: In ternational Conference on Learning Represen tations (2020) [59] V aswani, A., Shazeer, N., Parmar, N., Uszk oreit, J., Jones, L., Gomez, A.N., Kaiser, L., P olosukhin, I.: Atten tion is all you need. In: Adv ances in Neural Information Pro cessing Systems, pp. 5998–6008 (2017) [60] Chung, F.R., Graham, F.C.: Spectral graph theory . American Mathe- matical So c. (1997) Springer Nature 2021 L A T E X template 36 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation [61] Brouw er, A.E., Haemers, W.H.: Sp ectra of graphs. Springer (2011) [62] Cvetk ovic, D.M., DM, C., et al.: Spectra of graphs. theory and applica- tion (1980) [63] Schoen, R., Shing-T ung Y au Mack, C.A.: Lectures on Differential Geometry . International Press (1994) [64] Kloster, K., Gleich, D.F.: Heat kernel based communit y detection. In: Pro ceedings of the 20th ACM SIGKDD In ternational Conference on Kno wledge Discov ery and Data Mining, pp. 1386–1395 (2014). ACM [65] Lafferty , J., Lebanon, G.: Diffusion kernels on statistical manifolds. Journal of Mac hine Learning Research 6 (Jan), 129–163 (2005) [66] Xu, B., Shen, H., Cao, Q., Cen, K., Cheng, X.: Graph conv olutional net works using heat kernel for semi-supervised learning. arXiv preprint arXiv:2007.16002 (2020) [67] Klicp era, J., W eißenberger, S., G ¨ unnemann, S.: Diffusion improv es graph learning. In: Adv ances in Neural Information Pro cessing Systems, pp. 13354–13366 (2019) [68] Kschisc hang, F.R., F rey , B.J., Loeliger, H.-A., et al. : F actor graphs and the sum-pro duct algorithm. IEEE T ransactions on information theory 47 (2), 498–519 (2001) [69] Kuyer, L., Whiteson, S., Bakker, B., Vlassis, N.: Multiagent rein- forcemen t learning for urban traffic control using co ordination graphs. In: Joint Europ ean Conference on Mac hine Learning and Knowledge Disco very in Databases, pp. 656–671 (2008). Springer [70] Guestrin, C., Koller, D., Parr, R.: Multiagen t planning with factored mdps. In: Adv ances in Neural Information Processing Systems, pp. 1523– 1530 (2002) [71] Liao, W., Bak-Jensen, B., Pillai, J.R., W ang, Y., W ang, Y.: A review of graph neural net works and their applications in pow er systems. arXiv preprin t arXiv:2101.10025 (2021) [72] Zhou, H., Ren, D., Xia, H., F an, M., Y ang, X., Huang, H.: Ast-gnn: An atten tion-based spatio-temporal graph neural net work for interaction- a ware pedestrian tra jectory prediction. Neuro computing 445 , 298–308 (2021) [73] Huang, Y., Bi, H., Li, Z., Mao, T., W ang, Z.: Stgat: Mo deling spatial- temp oral in teractions for human tra jectory prediction. In: Pro ceedings Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 37 of the IEEE/CVF In ternational Conference on Computer Vision, pp. 6272–6281 (2019) [74] Mohamed, A., Qian, K., Elhosein y , M., Claudel, C.: So cial-stgcnn: A so cial spatio-temp oral graph con volutional neural netw ork for h uman tra jectory prediction. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition, pp. 14424–14432 (2020) [75] Xu, Z., Zhang, B., Bai, Y., Li, D., F an, G.: Learning to co ordinate via m ultiple graph neural netw orks. arXiv preprint arXiv:2104.03503 (2021) [76] W ang, Y., Xu, T., Niu, X., T an, C., Chen, E., Xiong, H.: Stmarl: A spatio-temp oral multi-agen t reinforcement learning approac h for traffic ligh t control. arXiv preprin t arXiv:1908.10577 (2019) [77] Li, S., Gupta, J.K., Morales, P ., Allen, R., Ko chenderfer, M.J.: Deep implicit co ordination graphs for multi-agen t reinforcemen t learning. arXiv preprin t arXiv:2006.11438 (2020) [78] Jiang, J., Dun, C., Huang, T., Lu, Z.: Graph conv olutional reinforcemen t learning. arXiv preprin t arXiv:1810.09202 (2018) [79] Chen, H., Liu, Y., Zhou, Z., Hu, D., Zhang, M.: Gama: Graph attention m ulti-agent reinforcement learning algorithm for co operation. Applied In telligence 50 (12), 4195–4205 (2020) [80] Sera j, E., W ang, Z., Paleja, R., Sklar, M., Patel, A., Gom b ola y , M.: Het- erogeneous graph atten tion netw orks for learning diverse communication. arXiv preprin t arXiv:2108.09568 (2021) [81] Su, J., Adams, S., Beling, P .A.: Coun terfactual multi-agen t reinforce- men t learning with graph conv olution communication. arXiv preprin t arXiv:2004.00470 (2020) [82] Y uan, Q., F u, X., Li, Z., Luo, G., Li, J., Y ang, F.: Graphcomm: Efficient graph conv olutional comm unication for m ulti-agent co operation. IEEE In ternet of Things Journal (2021) [83] Niu, Y., Paleja, R., Gombolay , M.: Multi-agent graph-attention commu- nication and teaming. In: Pro ceedings of the 20th International Con- ference on Autonomous Agen ts and MultiAgen t Systems, pp. 964–973 (2021) [84] Littman, M.L.: Marko v games as a framework for m ulti-agent reinforce- men t learning. In: Machine Learning Proceedings 1994, pp. 157–163. Elsevier, ??? (1994) Springer Nature 2021 L A T E X template 38 L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation [85] Kondor, R., Lafferty , J.: Diffusion k ernels on graphs and other discrete input spaces. icml 2002. In: Pro c, pp. 315–322 (2002) [86] Fiedler, M.: Laplacian of graphs and algebraic connectivity . Banac h Cen ter Publications 25 (1), 57–70 (1989) [87] Al-Mohy , A.H., Higham, N.J.: A new scaling and squaring algorithm for the matrix exp onential. SIAM Journal on Matrix Analysis and Applications 31 (3), 970–989 (2009) [88] Cheng, A.H.-D., Cheng, D.T.: Heritage and early history of the b oundary elemen t metho d. Engineering Analysis with Boundary Elements 29 (3), 268–302 (2005) [89] Mesbahi, M., Egerstedt, M.: Graph theoretic metho ds in multiagen t net works. Princeton Univ ersity Press (2010) [90] Balch, T., Arkin, R.C.: Behavior-based formation control for multirobot teams. IEEE transactions on rob otics and automation 14 (6), 926–939 (1998) [91] Agarwal, A., Kumar, S., Sycara, K.: Learning transferable co op erativ e b eha vior in multi-agen t teams. arXiv preprin t arXiv:1906.01202 (2019) [92] Mordatch, I., Abb eel, P .: Emergence of grounded comp ositional language in m ulti-agent populations. arXiv preprint arXiv:1703.04908 (2017) [93] Schmidh ub er, J.: A general method for multi-agen t reinforcement learn- ing in unrestricted environmen ts. In: Adaptation, Co ev olution and Learning in Multiagent Systems: Papers from the 1996 AAAI Spring Symp osium, pp. 84–87 (1996) [94] Kingma, D.P ., Ba, J.: Adam: A metho d for sto c hastic optimization. arXiv preprin t arXiv:1412.6980 (2014) [95] V an Rossum, G., Drake Jr, F.L.: Python tutorial. Centrum voor Wiskunde en Informatica Amsterdam, The Netherlands (1995) [96] Paszk e, A., Gross, S., Chintala, S., Chanan, G., Y ang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyT orch (2017) [97] Hagb erg, A., Swart, P ., S Ch ult, D.: Exploring netw ork structure, dynamics, and function using net workx. T ec hnical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008) [98] Liu, Y.-C., Tian, J., Glaser, N., Kira, Z.: When2com: Multi-agen t p erception via communication graph grouping. In: Proceedings of the Springer Nature 2021 L A T E X template L e arning Multi-A gent Co or dination thr ough Conne ctivity-driven Communic ation 39 IEEE/CVF Conference on Computer Vision and P attern Recognition, pp. 4106–4115 (2020) [99] Breazeal, C., Kidd, C.D., Thomaz, A.L., Hoffman, G., Berlin, M.: Effects of nonv erbal communication on efficiency and robustness in h uman-robot team work. In: 2005 IEEE/RSJ In ternational Conference on Intelligen t Rob ots and Systems, pp. 708–713 (2005). IEEE [100] Mech, L.D., Boitani, L.: W olv es: Behavior, Ecology , and Conserv ation. Univ ersity of Chicago Press, ??? (2007) [101] Quick, N.J., Janik, V.M.: Bottlenose dolphins exc hange signature whis- tles when meeting at sea. Pro ceedings of the Roy al So ciet y B: Biological Sciences 279 (1738), 2539–2545 (2012) [102] Schaller, G.B.: The Serengeti lion: a study of predator-prey relations. Univ ersity of Chicago press (2009) [103] Montesello, F., D’Angelo, A., F errari, C., Pagello, E.: Implicit coordi- nation in a multi-agen t system using a b eha vior-based approac h. In: Distributed Autonomous Robotic Systems 3, pp. 351–360. Springer, ??? (1998) [104] Grup en, N.A., Lee, D.D., Selman, B.: Multi-agent curricula and emer- gen t implicit signaling. In: Pro ceedings of the 21st In ternational Con- ference on Autonomous Agents and Multiagen t Systems, pp. 553–561 (2022) [105] Gildert, N., Millard, A.G., P omfret, A., Timmis, J.: The need for com bin- ing implicit and explicit communication in co operative rob otic systems. F rontiers in Rob otics and AI 5 , 65 (2018) [106] H ˚ ak ansson, G., W estander, J.: Communication in Humans and Other Animals. John Benjamins Publishing Company Amsterdam, ??? (2013) [107] Bonacich, P .: Some unique prop erties of eigenv ector cen trality . So cial net works 29 (4), 555–564 (2007) [108] Xiao, B., Wilson, R.C., Hanco c k, E.R.: Characterising graphs using the heat k ernel. (2005) [109] Jia, J., Schaub, M.T., Segarra, S., Benson, A.R.: Graph-based semi- sup ervised & active learning for edge flows. In: Pro ceedings of the 25th A CM SIGKDD International Conference on Kno wledge Discov ery & Data Mining, pp. 761–771 (2019)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment