Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a grap…

Authors: Enguang Fan, Yifan Chen, Zihan Shan

Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment
Communication-A ware Multi-Agent Reinforcement Learning for Decentralized Cooperati v e U A V Deployment Enguang Fan †* , Y ifan Chen †* , Zihan Shan † , Matthe w Caesar † , Jae Kim § † Univ ersity of Illinois at Urbana-Champaign, § Boeing Research and T echnology Emails: { enguang2, yifanc3, zshan2, caesar } @illinois.edu, { jae.h.kim } @boeing.com Abstract —A utonomous Unmanned Aerial V ehicle (U A V) swarms are incr easingly used as rapidly deployable aerial relays and sensing platf orms, yet practical deployments must operate under partial observ ability and intermittent peer-to-peer links. W e present a graph-based multi-agent reinfor cement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are av ailable only during training, while each U A V executes a shared policy using local observations and messages from nearby neighbors. Our architectur e encodes local agent state and nearby entities with an agent–entity attention module, and aggregates inter -U A V messages with neighbor self-attention over a distance- limited communication graph. W e evaluate primarily on a cooperative relay deployment task (DroneConnect) and secon- darily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage un- der restricted communication and partial observation (e.g. 74% coverage with M =5 U A Vs and N =10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based (offline) upper bound, and it generalizes to unseen team sizes without fine-tuning . In the adversarial setting, the same framework transfers without architectural changes and impro ves win-rate over non-communicating baselines. Index T erms —Unmanned Aerial V ehicles, U A V networks, Multi-Agent Reinforcement Learning, Decentralized Control, Graph Neural Networks I . I N T RO D U C T I O N Unmanned Aerial V ehicles (U A Vs), commonly kno wn as drones, are increasingly deployed as mobile sensing and communication platforms. A prominent application is to use autonomous U A Vs as rapidly deployable aerial r elays when terrestrial infrastructure is damaged by natural disasters or ov erloaded during crowded ev ents [1], [2]. In scenarios such as wildfire monitoring and battlefield surveillance, UA Vs may also operate beyond an operator’ s control radius, requiring on-board autonomy and peer-to-peer coordination. In such scenarios, U A V teams must decide where to position them- selves to maximize sensing or communication cov erage over dynamic areas of interest. These coverage and placement decisions naturally giv e rise to optimization formulations. Many multi-UA V coverage and deployment tasks can be viewed through the lens of the Maximum Coverage Location Problem (MCLP), which is NP-hard and becomes compu- tationally intractable in large or dynamic environments [3]. * These authors contrib uted equally to this work. Moreov er , real deployments are characterized by partial ob- servability (each U A V can only sense nearby entities) and communication constraints (only nearby U A Vs can exchange messages), making purely centralized controllers fragile and difficult to scale. T o address these challenges, we dev elop a scalable multi- U A V deployment control system by integrating Multi-Agent Reinforcement Learning (MARL) with a graph-based en vi- ronment representation. W e adopt centralized training with decentralized execution (CTDE): during training, a centralized critic can access global information to stabilize learning, while during execution each U A V runs a shared policy us- ing only local observations and peer-to-peer messages from communication neighbors. Concretely , we model the en vi- ronment as an agent–entity graph and use attention-based embeddings to represent the state of each agent. W e e valuate our approach primarily on a cooperativ e relay deployment task ( Dr oneConnect ) under full/partial observability and un- restricted/restricted communication, and we include a sec- ondary mixed cooperativ e–competitiv e task ( Dr oneCombat ) to demonstrate applicability beyond fully cooperativ e settings. T o summarize, our contributions are as follo ws. • W e propose a multi-agent reinforcement learning (MARL) framew ork for multi-U A V deployment. The proposed framew ork is trained with centralized training decentralized ex ecution (CTDE), under partial observability and distance- limited communication constraints. • W e introduce a dual-attention graph encoder which en- codes: (i) agent–entity attention for local en vironment embedding and (ii) neighbor self-attention for inter-agent message sharing aggregation. • W e demonstrate high-coverage decentralized relay deploy- ment in a fully cooperativ e task ( Dr oneConnect ) and sho w the same framew ork transfers to an adversarial engagement task ( Dr oneCombat ) without architectural changes. The rest of this paper is structured as follows. Section II re views prior research in learning and wireless commu- nication for multi-U A V systems. Section III introduces our en vironment embedding, message sharing, and CTDE learn- ing design. Section IV presents the simulation scenarios. W e report ev aluation results in Section V and conclude in Section VI. T ABLE I: Key notation used throughout the paper . Symbol Description M / N number of UA Vs / number of nodes (entities) p i ( t ) / u j ( t ) position of UA V i / node j at time t r s / r c / r cov sensing radius / communication radius / coverage radius N s ( i ) entities sensed by UA V i (within r s ) N c ( i ) U A V neighbors of i that can communicate (within r c ) h i / m i latent embedding of UA V i / aggregated message FO/PO full / partial observability UC/RC unrestricted / restricted communication I I . R E L A T E D W O R K This section revie ws existing research related to our work. W e first summarize reinforcement-learning approaches for U A V control, then discuss communication mechanisms and graph-based representations for multi-agent coordination. Maximum Coverage Location Problem f or multiple U A Vs : Early research on U A V deployment has commonly framed the problem as a variant of the Maximum Cov erage Location Problem (MCLP). When applied to U A V systems, MCLP captures the core challenge of selecting drone lo- cations that maximize sensing or communication coverage under resource constraints. A representative formulation is the Maximum Cov erage Facility Location Problem with Drones (MCFLPD) proposed by Chauhan et al. [4], which models U A V deployment as a static mixed-inte ger program incorpo- rating battery-limited range, energy consumption, and facility capacities; despite its expressiv eness, the MCFLPD quickly becomes computationally expensiv e and requires specialized heuristics for a tractable solution. Reinfor cement Learning for Multi-U A V Systems : Re- searchers ha ve demonstrated the utility of reinforcement learning algorithms in U A V -assisted wireless networks [5], [6]. Lee et al. [7] introduced the DroneDR frame work for U A V deployment within a centralized architecture. Still, this approach poses a single point of failure and lacks practicality in satellite-denied en vironments. Kaviani et al. [8] proposed DeepCQ+, a deep reinforcement-learning-based routing pro- tocol for highly dynamic mobile ad hoc networks. Our work addresses similar scenarios but adopts a distrib uted approach. U A Vs are also utilized for wildfire monitoring in [9], where Julian et al. employed deep Q-learning for path planning. At the same time, the field of adv ersarial multi-UA V environ- ments remains relativ ely unexplored. Multi-Agent Reinfor cement Learning : MARL encom- passes three interaction categories: fully cooperativ e, fully competitiv e, and mixed. Our objectiv e is to dev elop algo- rithms suitable for all agent interactions. Moreover , MARL approaches can be cate gorized into centralized, decentralized, and hybrid methods. Centralized approaches use a single agent to formulate multi-agent systems, which is challenging to scale as the number of agents increases. Decentralized approaches, such as T ampuu et al. ’ s Q-learning [10], employ independent Q-value functions for each agent but struggle in non-stationary en vironments. Centralized learning with decentralized execution is another approach, represented by algorithms such as COMA [11], BiCNet [12], and MADDPG [13], in which a centralized critic is accessible only during training. Howe ver , such approaches often assume a fixed number of agents, limiting their applicability in dynamic en vironments with varying agent counts. Communication Mechanisms Between Agents : Many MARL approaches neglect explicit inter-agent communi- cation. Differentiable communication protocols, as seen in CommNet [14] and V AIN [15], have impro ved communica- tion using attention mechanisms. W e apply scaled dot-product attention for inter-agent communication. Real-world scenarios often impose communication limits based on proximity , as seen in T arMAC [16]. DGN [17] allows agents to communi- cate with their nearest neighbors, aligning with practical drone swarm operations. W e introduce distance-based communica- tion restrictions, leading to multiple connected components in the agent graph and thereby increasing complexity and cooperation within the team. Graph Neural Networks : Graphs naturally model multi- agent systems, with nodes representing agents. GNNs, such as message-passing neural networks [18] and the Graph Attention Network (GA T) [19], employ trainable weights for feature propagation among nodes. Agarwal et al. [20] introduced entity graphs for environment integration, focus- ing on fully cooperative settings. OpenAI explored multi- agent reinforcement learning for the emergence of complex behavior [21]. In our work, we adopt an agent-entity graph to aggregate en vironment information across diverse settings, including cooperativ e, competitiv e, and mixed en vironments. I I I . E N V I RO N M E N T M O D E L I N G T o model large multi-agent en vironments efficiently , we represent the swarm and its surroundings as an ag ent–entity graph G = ( V , E ) , where v ertices correspond to U A V agents and observable en vironment entities (e.g., ground nodes), and edges encode sensing and communication relationships. Each UA V i forms (i) a sensed-entity set N s ( i ) from entities within its sensing radius and (ii) a communication-neighbor set N c ( i ) from U A Vs within its communication radius. This separation allo ws us to model partial observability (via N s ) and r estricted communication (via N c ) in a unified way . A. Message P assing Over the Communication Graph W e use message passing to aggregate information among U A Vs over the communication graph. Let h ( k ) i denote U A V i ’ s latent embedding after k rounds of communication mes- sage passing, with h (0) i initialized from its local observation embedding (Section III-B). A generic message passing round can be written as m ( k ) i = f ( k ) agg  h ( k − 1) i , { h ( k − 1) j : j ∈ N c ( i ) }  , 1 ≤ k ≤ K (1) h ( k ) i = f upd  h ( k − 1) i , m ( k ) i  , (2) where m ( k ) i is the aggreg ated message and K is the number of message passing rounds. Here, f ( k ) agg ( · ) denotes a permutation- in v ariant aggregation operator over neighbor embeddings (in- stantiated as attention-weighted aggregation in Section III-C), and f upd ( · ) is a learnable update function (e.g., an MLP or GR U) that fuses the previous embedding with the aggre- gated message. In unrestricted communication (UC), N c ( i ) contains all other U A Vs; in restricted communication (RC), N c ( i ) = { j  = i : ∥ p i − p j ∥ 2 ≤ r c } . B. Envir onment Embedding Each U A V i maintains a local state S i (e.g., position and velocity) and observes a v ariable-size set of entities within its sensing range. W e encode the U A V state and entity features via h a i = f a ( S i ) , (3) e i,l = f e ( x i,l ) , l ∈ N s ( i ) , (4) where x i,l denotes the feature vector of entity l as observed by UA V i . T o obtain a fixed-size en vironment summary that is in variant to the number of sensed entities, we apply scaled dot-product attention with the U A V embedding as the query and entity embeddings as keys/v alues: q i = W q h a i , (5) k i,l = W k e i,l , l ∈ N s ( i ) , (6) v i,l = W v e i,l , l ∈ N s ( i ) , (7) α i,l = exp  q ⊤ i k i,l / √ d k  P l ′ ∈N s ( i ) exp  q ⊤ i k i,l ′ / √ d k  , (8) E ag g i = X l ∈N s ( i ) α i,l v i,l , (9) h (0) i = [ h a i ; E ag g i ] . (10) where d k is the key dimension and [; ] denotes concatena- tion. The resulting h (0) i is used as the per-agent input to the policy and as the initialization for communication message passing. C. Inter-Agent Messag e Sharing U A V i aggregates messages from its communication neigh- bors N c ( i ) using self-attention. Let ˜ N c ( i ) = N c ( i ) ∪ { i } denote the neighbor set with a self-loop. For round k , we compute q c i = W c q h ( k − 1) i , (11) k c j = W c k h ( k − 1) j , j ∈ ˜ N c ( i ) , (12) v c j = W c v h ( k − 1) j , j ∈ ˜ N c ( i ) , (13) β i,j = exp  ( q c i ) ⊤ k c j / √ d k  P j ′ ∈ ˜ N c ( i ) exp  ( q c i ) ⊤ k c j ′ / √ d k  , (14) m ( k ) i = X j ∈ ˜ N c ( i ) β i,j v c j , (15) h ( k ) i = f upd  h ( k − 1) i , m ( k ) i  (16) During decentralized execution, each U A V performs this ag- gregation using only messages received from N c ( i ) , matching the RC setting. D. Centralized T raining with Decentralized Execution (CTDE) W e train the swarm under CTDE. During ex ecution, each U A V i samples actions from a decentralized actor a i ( t ) ∼ π θ  · | o i ( t ) , h ( K ) i ( t )  , (17) where o i ( t ) is the local observ ation and h ( K ) i ( t ) is the final embedding after K communication rounds. During training, we additionally use a centralized critic that has access to global information, e.g., V ϕ ( s ( t )) , s ( t ) = { S i ( t ) } M i =1 ∪ { u j ( t ) } N j =1 . (18) The critic is used only for learning; at test time, U A Vs execute π θ without access to s ( t ) or any centralized coordinator . I V . S C E NA R I O S A N D T A S K S W e study multi-U A V deployment under partial observabil- ity and distance-limited communication. Our primary focus is a cooperati ve relay deployment task ( Dr oneConnect ), and we include a secondary mixed cooperative–competiti ve task ( Dr oneCombat ) to demonstrate that the same CTDE graph- based frame work applies beyond purely cooperativ e settings. W e also compare against a static optimization-based formu- lation as a reference upper bound. A. Optimization-Based Static V iew A common abstraction of cov erage and relay placement is to maximize the amount of demand covered within a service radius while penalizing relocation costs, where facilities rep- resent drones and demand represents ground nodes. Let p 0 i be the current position of U A V i , and let p i be its placement decision. A simplified maximum-coverage objecti ve can be written as max { p i } M i =1 N X j =1 w j z j − α M X i =1   p i − p 0 i   2 (19) s.t. z j = I  min i ∈{ 1 ,...,M } ∥ u j − p i ∥ 2 ≤ r cov  , j = 1 , . . . , N where w j is a node priority weight and r cov is the ser- vice/cov erage radius. This problem is NP-hard; to provide an optimization-based reference, we discretize candidate U A V locations and solve a mixed-integer linear programming (MILP) formulation of (19) offline, which serves as an approximate upper bound on the attainable coverage for static snapshots. W e highlight that MILP requires intensi ve computation and therefore not suitable for online control. B. Dr oneConnect Scenario DroneConnect models a team of U A V relays repositioning to provide cov erage to mobile ground nodes (Fig. 1). 1) Action Space: For UA V i , the continuous action is a 2D force (or acceleration command) a i = ( F x , F y ) that updates its velocity and position. 2) Observability and Communication Settings: W e ev alu- ate four settings that combine observation and communication constraints: • FO (full observability) : each U A V observes all ground- node states (and U A V states). • PO (partial observability) : each U A V observes only enti- ties within sensing radius r s (i.e., N s ( i ) ). • UC (unrestricted communication) : all U A Vs can ex- change messages (complete communication graph). • RC (r estricted communication) : U A Vs communicate only if within radius r c (i.e., N c ( i ) ). 3) Rewar d: The DroneConnect task is fully cooperati ve: all UA Vs share the same team reward during centralized training, and ex ecute decentralized policies at test time. Let d j ( t ) = min i ∥ u j ( t ) − p i ( t ) ∥ 2 be the distance from node j to its nearest U A V , and let c j ( t ) = I [ d j ( t ) ≤ r cov ] indicate whether node j is covered. W e use the normalized re ward r ( t ) = λ cov · 1 N N X j =1 c j ( t ) − λ dist · 1 N N X j =1 d j ( t ) r cov , (20) where λ cov and λ dist trade off coverage quantity and service quality . C. Dr oneCombat Scenario T o demonstrate generality beyond fully cooperati ve set- tings, we also consider a lightweight adversarial engagement en vironment Dr oneCombat as a secondary task. Drones are split into two teams, and each team aims to eliminate all opponents by firing directional laser beams. W e emphasize that DroneCombat is used here only as an auxiliary scenario; our primary ev aluation is DroneConnect. Fig. 1. DroneConnect en vironment with 2 U A V relays and 4 mobile nodes. 1) Action Space: Each drone uses a 3D continuous action a i = ( F x , F y , F rot ) that controls planar motion and rotation (firing direction). 2) Rewar d: W e use a sparse-and-dense shaped reward (T able II) that encourages successful hits, discourages wasted firing, and rew ards winning quickly . Event Reward Drone i emits a laser beam -0.1 Drone i emits a laser b ut misses any target -1 Drone i ’ s laser hits an opponent +3 Drone i gets hit by a laser beam -3 Each timestep − 50 /T All opponent drones are eliminated +20 T ABLE II: Rewards for the secondary DroneCombat task ( T is episode length). V . E V A L U A T I O N W e ev aluate the proposed CTDE graph-based MARL framew ork on aforementioned two scenarios in Section IV. Our primary focus is the cooperati ve DroneConnect task while DroneCombat is reported as a secondary demonstration of generality . A. Experimental Setup W e report representative settings here to support repro- ducibility . In DroneConnect, U A Vs and nodes move in a bounded 2D area of size 100 × 100 with timestep ∆ t = 0 . 1 s and episode length T = 200 steps. Each UA V senses entities within radius r s and communicates within radius r c (RC); in UC we allow all-to-all messaging. W e train using CTDE actor–critic (PPO-style) with a centralized critic and decen- tralized actors, for 2 × 10 6 en vironment steps and e v aluate ov er 50 episodes. Unless otherwise stated, we a verage results over 3 random seeds. For communication, each U A V transmits a d =64 -dimensional message embedding to each neighbor per timestep; thus the per -step communication cost scales with the av erage degree of the RC graph. B. Dr oneConnect Results W e use M to represent the number of U A Vs and N to represent the number of ground nodes to cov er . The e valuation Fig. 2. A verage coverage per timestep as the number of U A Vs ( M ) and nodes ( N ) vary (FO+UC setting). Method M N Comm Obs Coverage Ratio Ours (CTDE, dual-attention) 3 6 UC FO 0 . 76 ± 0 . 01 Ours (CTDE, dual-attention) 3 6 RC FO 0 . 74 ± 0 . 02 Ours (CTDE, dual-attention) 3 6 UC PO 0 . 72 ± 0 . 02 Ours (CTDE, dual-attention) 3 6 RC PO 0 . 71 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 UC FO 0 . 79 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 RC FO 0 . 77 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 UC PO 0 . 76 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 RC PO 0 . 74 ± 0 . 02 Ablation: no inter-communication 5 10 RC PO 0 . 65 ± 0 . 03 Ablation: no entity attention 5 10 RC PO 0 . 63 ± 0 . 03 Static MILP upper bound 3 6 – – 0 . 77 Static MILP upper bound 5 10 – – 0 . 80 Centralized single-agent RL 3 6 – FO 0 . 75 ± 0 . 02 Centralized single-agent RL 5 10 – FO 0 . 79 ± 0 . 02 T ABLE III: DroneConnect coverage results (mean ± std ov er 3 seeds). UC/RC: unrestricted/restricted communication; FO/PO: full/partial observability . metric is the average coverage r atio over an episode, defined as the number of covered nodes di vided by the total number of nodes. 1) Coverag e Results and Ablations: T able III summarizes cov erage under different observability and communication constraints. Overall, our CTDE approach maintains strong cov erage under partial observ ation and restricted communica- tion, while remaining competitiv e with the static MILP upper bound. W e also report two minimal ablations: disabling inter - U A V communication and replacing entity attention with mean pooling; both degrade performance in the challenging RC+PO (restricted communication and partial observability) setting. 2) Zer o-Shot P olicy Generalization: W e ev aluate zero-shot generalization by applying a policy trained with M =5 UA Vs directly to scenarios with different team sizes (RC+PO), without fine-tuning. T able IV reports representative coverage ratios. Overall, performance remains stable across team sizes, suggesting that the graph-based representation helps the pol- icy generalize across v arying numbers of agents. 3) Qualitative Coor dination and Overlap: In addition to cov erage ratio, we quantify coordination by measuring the T ABLE IV: Cov erage results with zero-shot transfer learning in DroneConnect. Num drones Num nodes Comm Obs Coverage Ratio M − 2 6 RC PO 0.70 M − 1 6 RC PO 0.72 M = 5 10 RC PO 0.74 M + 1 10 RC PO 0.80 M + 2 10 RC PO 0.82 covera ge overlap rate , defined as the fraction of cov ered nodes that are simultaneously within r cov of more than one U A V . In the challenging RC+PO setting with M =5 , N =10 , our learned policy achie ves a low overlap rate 0 . 12 compared to the no-communication ablation 0 . 19 , indicating better di- vision of co verage responsibilities. Figure 3 illustrates repre- sentativ e trajectories and communication links under dif ferent team sizes. (a) M = 3 , N = 6 (b) M = 3 , N = 10 (c) M = 4 , N = 8 (d) M = 5 , N = 10 Fig. 3. Representati ve DroneConnect snapshots sho wing division of coverage tasks. Red links indicate in-range (RC) inter -U A V communication. C. Dr oneCombat Results W e briefly report results on the DroneCombat task (Sec- tion IV -C) to demonstrate that the same CTDE graph- based architecture can be applied to a mixed cooperati ve– competitiv e en vironment and leav e detailed study to future work. W e ev aluate a 5-vs-5 setting and compare against simple baselines. Method 5v5 win rate 5v5 a vg. episode steps Ours (CTDE, dual-attention) 0 . 62 ± 0 . 05 140 ± 18 No-communication ablation 0 . 49 ± 0 . 04 168 ± 22 Independent agents (no CTDE) 0 . 42 ± 0 . 03 175 ± 25 T ABLE V: DroneCombat results (mean ± std ov er 3 seeds). Fig. 4. Sample progression of a 5-vs-5 DroneCombat episode (attackers in pink, defenders in cyan). Eliminated drones are shown in darker colors. V I . C O N C L U S I O N W e presented a centralized training with decentr alized exe- cution (CTDE) multi-agent reinforcement learning framework for cooperative U A V deployment under partial observ ability and communication constraints. Our method represents the en vironment as an agent–entity graph and uses dual attention: agent–entity attention for local en vironment embedding and neighbor self-attention for inter-U A V message aggregation. During ex ecution, each U A V runs a decentralized policy using only local observations and peer-to-peer messages, without any centralized coordinator . In the cooperative DroneConnect task, our approach achiev es high coverage under restricted communication and partial observ ability while remaining competitive with the static MILP upper bound. W e also showed that the learned policy can generalize zero-shot to different team sizes in DroneConnect. Finally , we included a mixed cooperativ e– competitiv e DroneCombat scenario to illustrate that the same architecture can be applied beyond purely cooperati ve set- tings. W e leave detailed wireless QoS models and explicit communication-cost constraints analysis for future work. R E F E R E N C E S [1] S. Yin, Z. Qu, and L. Li, “Uplink resource allocation in cellular net- works with energy-constrained uav relay , ” in 2018 IEEE 87th V ehicular T echnology Confer ence (VTC Spring) , pp. 1–5, IEEE, 2018. [2] E. Fan, A. Peng, M. Caesar , J. Kim, J. Eckhardt, G. Kimberly , and D. Osipychev , “T o wards effecti ve swarm-based gps spoofing detection in disadvantaged platforms, ” in MILCOM 2023 - 2023 IEEE Military Communications Conference (MILCOM) , pp. 722–728, 2023. [3] M. Sobouti, R. Mahapatra, and M. A. Rahman, “Utilizing uavs in wireless networks: Advantages, challenges, objectiv es, and solution methods, ” V ehicles , vol. 6, no. 3, pp. 764–789, 2024. [4] D. Chauhan, A. Unnikrishnan, and M. Figliozzi, “Maximum coverage capacitated facility location problem with range constrained drones, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 99, pp. 1– 18, 2019. [5] J. Hu, H. Zhang, and L. Song, “Reinforcement learning for decen- tralized trajectory design in cellular uav networks with sense-and-send protocol, ” IEEE Internet of Things Journal , v ol. 6, no. 4, pp. 6177– 6189, 2019. [6] C. W ang, J. W ang, Y . Shen, and X. Zhang, “ Autonomous na vigation of uavs in lar ge-scale complex environments: A deep reinforcement learn- ing approach, ” IEEE Tr ansactions on V ehicular T echnology , vol. 68, no. 3, pp. 2124–2136, 2019. [7] I. Lee, V . Babu, M. Caesar, and D. Nicol, “Deep reinforcement learning for uav-assisted emergenc y response, ” in MobiQuitous 2020 - 17th EAI International Confer ence on Mobile and Ubiquitous Systems: Computing, Networking and Services , MobiQuitous ’20, (New Y ork, NY , USA), p. 327–336, Association for Computing Machinery , 2021. [8] S. Kaviani, B. Ryu, E. Ahmed, K. Larson, A. Le, A. Y ahja, and J. H. Kim, “Deepcq+: Robust and scalable routing with multi-agent deep reinforcement learning for highly dynamic networks, ” in MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM) , pp. 31–36, 2021. [9] K. D. Julian and M. J. Kochenderfer , “Distributed wildfire surveillance with autonomous aircraft using deep reinforcement learning, ” Journal of Guidance, Contr ol, and Dynamics , vol. 42, no. 8, pp. 1768–1778, 2019. [10] A. T ampuu, T . Matiisen, D. K odelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. V icente, “Multiagent cooperation and competition with deep reinforcement learning, ” PLOS ONE , vol. 12, pp. 1–15, 04 2017. [11] J. N. Foerster , G. Farquhar , T . Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients, ” in Pr oceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , AAAI’18/IAAI’18/EAAI’18, AAAI Press, 2018. [12] P . Peng, Y . W en, Y . Y ang, Q. Y uan, Z. T ang, H. Long, and J. W ang, “Multiagent bidirectionally-coordinated nets: Emergence of human- lev el coordination in learning to play starcraft combat games, ” arXiv pr eprint arXiv:1703.10069 , 2017. [13] R. Lowe, Y . I. W u, A. T amar , J. Harb, O. Pieter Abbeel, and I. Mor - datch, “Multi-agent actor-critic for mixed cooperative-competiti ve en vi- ronments, ” Advances in neural information pr ocessing systems , v ol. 30, 2017. [14] S. Sukhbaatar , R. Fergus, et al. , “Learning multiagent communication with backpropagation, ” Advances in neural information processing systems , vol. 29, 2016. [15] Y . Hoshen, “V ain: Attentional multi-agent predictive modeling, ” Ad- vances in neur al information pr ocessing systems , vol. 30, 2017. [16] A. Das, T . Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “T armac: T argeted multi-agent communication, ” in Interna- tional Conference on machine learning , pp. 1538–1546, PMLR, 2019. [17] J. Jiang, C. Dun, T . Huang, and Z. Lu, “Graph con volutional reinforce- ment learning, ” arXiv preprint , 2018. [18] J. Gilmer , S. S. Schoenholz, P . F . Riley , O. V inyals, and G. E. Dahl, “Neural message passing for quantum chemistry , ” in Proceedings of the 34th International Conference on Machine Learning - V olume 70 , ICML ’17, p. 1263–1272, JMLR.org, 2017. [19] P . V eli ˇ ckovi ´ c, G. Cucurull, A. Casanova, A. Romero, P . Lio, and Y . Ben- gio, “Graph attention networks, ” arXiv pr eprint arXiv:1710.10903 , 2017. [20] A. Agarwal, S. Kumar , K. Sycara, and M. Le wis, “Learning transferable cooperativ e beha vior in multi-agent teams, ” in Pr oceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS ’20, (Richland, SC), p. 1741–1743, International Foundation for Autonomous Agents and Multiagent Systems, 2020. [21] B. Baker , I. Kanitscheider , T . Markov , Y . W u, G. Powell, B. McGrew , and I. Mordatch, “Emergent tool use from multi-agent autocurricula, ” in International confer ence on learning r epr esentations , 2019.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment