Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

Communication-A ware Multi-Agent Reinforcement Learning for Decentralized Cooperati v e U A V Deployment Enguang Fan †* , Y ifan Chen †* , Zihan Shan † , Matthe w Caesar † , Jae Kim § † Univ ersity of Illinois at Urbana-Champaign, § Boeing Research and T echnology Emails: { enguang2, yifanc3, zshan2, caesar } @illinois.edu, { jae.h.kim } @boeing.com Abstract —A utonomous Unmanned Aerial V ehicle (U A V) swarms are incr easingly used as rapidly deployable aerial relays and sensing platf orms, yet practical deployments must operate under partial observ ability and intermittent peer-to-peer links. W e present a graph-based multi-agent reinfor cement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are av ailable only during training, while each U A V executes a shared policy using local observations and messages from nearby neighbors. Our architectur e encodes local agent state and nearby entities with an agent–entity attention module, and aggregates inter -U A V messages with neighbor self-attention over a distance- limited communication graph. W e evaluate primarily on a cooperative relay deployment task (DroneConnect) and secon- darily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage un- der restricted communication and partial observation (e.g. 74% coverage with M =5 U A Vs and N =10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based (ofﬂine) upper bound, and it generalizes to unseen team sizes without ﬁne-tuning . In the adversarial setting, the same framework transfers without architectural changes and impro ves win-rate over non-communicating baselines. Index T erms —Unmanned Aerial V ehicles, U A V networks, Multi-Agent Reinforcement Learning, Decentralized Control, Graph Neural Networks I . I N T RO D U C T I O N Unmanned Aerial V ehicles (U A Vs), commonly kno wn as drones, are increasingly deployed as mobile sensing and communication platforms. A prominent application is to use autonomous U A Vs as rapidly deployable aerial r elays when terrestrial infrastructure is damaged by natural disasters or ov erloaded during crowded ev ents [1], [2]. In scenarios such as wildﬁre monitoring and battleﬁeld surveillance, UA Vs may also operate beyond an operator’ s control radius, requiring on-board autonomy and peer-to-peer coordination. In such scenarios, U A V teams must decide where to position them- selves to maximize sensing or communication cov erage over dynamic areas of interest. These coverage and placement decisions naturally giv e rise to optimization formulations. Many multi-UA V coverage and deployment tasks can be viewed through the lens of the Maximum Coverage Location Problem (MCLP), which is NP-hard and becomes compu- tationally intractable in large or dynamic environments [3]. * These authors contrib uted equally to this work. Moreov er , real deployments are characterized by partial ob- servability (each U A V can only sense nearby entities) and communication constraints (only nearby U A Vs can exchange messages), making purely centralized controllers fragile and difﬁcult to scale. T o address these challenges, we dev elop a scalable multi- U A V deployment control system by integrating Multi-Agent Reinforcement Learning (MARL) with a graph-based en vi- ronment representation. W e adopt centralized training with decentralized execution (CTDE): during training, a centralized critic can access global information to stabilize learning, while during execution each U A V runs a shared policy us- ing only local observations and peer-to-peer messages from communication neighbors. Concretely , we model the en vi- ronment as an agent–entity graph and use attention-based embeddings to represent the state of each agent. W e e valuate our approach primarily on a cooperativ e relay deployment task ( Dr oneConnect ) under full/partial observability and un- restricted/restricted communication, and we include a sec- ondary mixed cooperativ e–competitiv e task ( Dr oneCombat ) to demonstrate applicability beyond fully cooperativ e settings. T o summarize, our contributions are as follo ws. • W e propose a multi-agent reinforcement learning (MARL) framew ork for multi-U A V deployment. The proposed framew ork is trained with centralized training decentralized ex ecution (CTDE), under partial observability and distance- limited communication constraints. • W e introduce a dual-attention graph encoder which en- codes: (i) agent–entity attention for local en vironment embedding and (ii) neighbor self-attention for inter-agent message sharing aggregation. • W e demonstrate high-coverage decentralized relay deploy- ment in a fully cooperativ e task ( Dr oneConnect ) and sho w the same framew ork transfers to an adversarial engagement task ( Dr oneCombat ) without architectural changes. The rest of this paper is structured as follows. Section II re views prior research in learning and wireless commu- nication for multi-U A V systems. Section III introduces our en vironment embedding, message sharing, and CTDE learn- ing design. Section IV presents the simulation scenarios. W e report ev aluation results in Section V and conclude in Section VI. T ABLE I: Key notation used throughout the paper . Symbol Description M / N number of UA Vs / number of nodes (entities) p i ( t ) / u j ( t ) position of UA V i / node j at time t r s / r c / r cov sensing radius / communication radius / coverage radius N s ( i ) entities sensed by UA V i (within r s ) N c ( i ) U A V neighbors of i that can communicate (within r c ) h i / m i latent embedding of UA V i / aggregated message FO/PO full / partial observability UC/RC unrestricted / restricted communication I I . R E L A T E D W O R K This section revie ws existing research related to our work. W e ﬁrst summarize reinforcement-learning approaches for U A V control, then discuss communication mechanisms and graph-based representations for multi-agent coordination. Maximum Coverage Location Problem f or multiple U A Vs : Early research on U A V deployment has commonly framed the problem as a variant of the Maximum Cov erage Location Problem (MCLP). When applied to U A V systems, MCLP captures the core challenge of selecting drone lo- cations that maximize sensing or communication coverage under resource constraints. A representative formulation is the Maximum Cov erage Facility Location Problem with Drones (MCFLPD) proposed by Chauhan et al. [4], which models U A V deployment as a static mixed-inte ger program incorpo- rating battery-limited range, energy consumption, and facility capacities; despite its expressiv eness, the MCFLPD quickly becomes computationally expensiv e and requires specialized heuristics for a tractable solution. Reinfor cement Learning for Multi-U A V Systems : Re- searchers ha ve demonstrated the utility of reinforcement learning algorithms in U A V -assisted wireless networks [5], [6]. Lee et al. [7] introduced the DroneDR frame work for U A V deployment within a centralized architecture. Still, this approach poses a single point of failure and lacks practicality in satellite-denied en vironments. Kaviani et al. [8] proposed DeepCQ+, a deep reinforcement-learning-based routing pro- tocol for highly dynamic mobile ad hoc networks. Our work addresses similar scenarios but adopts a distrib uted approach. U A Vs are also utilized for wildﬁre monitoring in [9], where Julian et al. employed deep Q-learning for path planning. At the same time, the ﬁeld of adv ersarial multi-UA V environ- ments remains relativ ely unexplored. Multi-Agent Reinfor cement Learning : MARL encom- passes three interaction categories: fully cooperativ e, fully competitiv e, and mixed. Our objectiv e is to dev elop algo- rithms suitable for all agent interactions. Moreover , MARL approaches can be cate gorized into centralized, decentralized, and hybrid methods. Centralized approaches use a single agent to formulate multi-agent systems, which is challenging to scale as the number of agents increases. Decentralized approaches, such as T ampuu et al. ’ s Q-learning [10], employ independent Q-value functions for each agent but struggle in non-stationary en vironments. Centralized learning with decentralized execution is another approach, represented by algorithms such as COMA [11], BiCNet [12], and MADDPG [13], in which a centralized critic is accessible only during training. Howe ver , such approaches often assume a ﬁxed number of agents, limiting their applicability in dynamic en vironments with varying agent counts. Communication Mechanisms Between Agents : Many MARL approaches neglect explicit inter-agent communi- cation. Differentiable communication protocols, as seen in CommNet [14] and V AIN [15], have impro ved communica- tion using attention mechanisms. W e apply scaled dot-product attention for inter-agent communication. Real-world scenarios often impose communication limits based on proximity , as seen in T arMAC [16]. DGN [17] allows agents to communi- cate with their nearest neighbors, aligning with practical drone swarm operations. W e introduce distance-based communica- tion restrictions, leading to multiple connected components in the agent graph and thereby increasing complexity and cooperation within the team. Graph Neural Networks : Graphs naturally model multi- agent systems, with nodes representing agents. GNNs, such as message-passing neural networks [18] and the Graph Attention Network (GA T) [19], employ trainable weights for feature propagation among nodes. Agarwal et al. [20] introduced entity graphs for environment integration, focus- ing on fully cooperative settings. OpenAI explored multi- agent reinforcement learning for the emergence of complex behavior [21]. In our work, we adopt an agent-entity graph to aggregate en vironment information across diverse settings, including cooperativ e, competitiv e, and mixed en vironments. I I I . E N V I RO N M E N T M O D E L I N G T o model large multi-agent en vironments efﬁciently , we represent the swarm and its surroundings as an ag ent–entity graph G = ( V , E ) , where v ertices correspond to U A V agents and observable en vironment entities (e.g., ground nodes), and edges encode sensing and communication relationships. Each UA V i forms (i) a sensed-entity set N s ( i ) from entities within its sensing radius and (ii) a communication-neighbor set N c ( i ) from U A Vs within its communication radius. This separation allo ws us to model partial observability (via N s ) and r estricted communication (via N c ) in a uniﬁed way . A. Message P assing Over the Communication Graph W e use message passing to aggregate information among U A Vs over the communication graph. Let h ( k ) i denote U A V i ’ s latent embedding after k rounds of communication mes- sage passing, with h (0) i initialized from its local observation embedding (Section III-B). A generic message passing round can be written as m ( k ) i = f ( k ) agg  h ( k − 1) i , { h ( k − 1) j : j ∈ N c ( i ) }  , 1 ≤ k ≤ K (1) h ( k ) i = f upd  h ( k − 1) i , m ( k ) i  , (2) where m ( k ) i is the aggreg ated message and K is the number of message passing rounds. Here, f ( k ) agg ( · ) denotes a permutation- in v ariant aggregation operator over neighbor embeddings (in- stantiated as attention-weighted aggregation in Section III-C), and f upd ( · ) is a learnable update function (e.g., an MLP or GR U) that fuses the previous embedding with the aggre- gated message. In unrestricted communication (UC), N c ( i ) contains all other U A Vs; in restricted communication (RC), N c ( i ) = { j  = i : ∥ p i − p j ∥ 2 ≤ r c } . B. Envir onment Embedding Each U A V i maintains a local state S i (e.g., position and velocity) and observes a v ariable-size set of entities within its sensing range. W e encode the U A V state and entity features via h a i = f a ( S i ) , (3) e i,l = f e ( x i,l ) , l ∈ N s ( i ) , (4) where x i,l denotes the feature vector of entity l as observed by UA V i . T o obtain a ﬁxed-size en vironment summary that is in variant to the number of sensed entities, we apply scaled dot-product attention with the U A V embedding as the query and entity embeddings as keys/v alues: q i = W q h a i , (5) k i,l = W k e i,l , l ∈ N s ( i ) , (6) v i,l = W v e i,l , l ∈ N s ( i ) , (7) α i,l = exp  q ⊤ i k i,l / √ d k  P l ′ ∈N s ( i ) exp  q ⊤ i k i,l ′ / √ d k  , (8) E ag g i = X l ∈N s ( i ) α i,l v i,l , (9) h (0) i = [ h a i ; E ag g i ] . (10) where d k is the key dimension and [; ] denotes concatena- tion. The resulting h (0) i is used as the per-agent input to the policy and as the initialization for communication message passing. C. Inter-Agent Messag e Sharing U A V i aggregates messages from its communication neigh- bors N c ( i ) using self-attention. Let ˜ N c ( i ) = N c ( i ) ∪ { i } denote the neighbor set with a self-loop. For round k , we compute q c i = W c q h ( k − 1) i , (11) k c j = W c k h ( k − 1) j , j ∈ ˜ N c ( i ) , (12) v c j = W c v h ( k − 1) j , j ∈ ˜ N c ( i ) , (13) β i,j = exp  ( q c i ) ⊤ k c j / √ d k  P j ′ ∈ ˜ N c ( i ) exp  ( q c i ) ⊤ k c j ′ / √ d k  , (14) m ( k ) i = X j ∈ ˜ N c ( i ) β i,j v c j , (15) h ( k ) i = f upd  h ( k − 1) i , m ( k ) i  (16) During decentralized execution, each U A V performs this ag- gregation using only messages received from N c ( i ) , matching the RC setting. D. Centralized T raining with Decentralized Execution (CTDE) W e train the swarm under CTDE. During ex ecution, each U A V i samples actions from a decentralized actor a i ( t ) ∼ π θ  · | o i ( t ) , h ( K ) i ( t )  , (17) where o i ( t ) is the local observ ation and h ( K ) i ( t ) is the ﬁnal embedding after K communication rounds. During training, we additionally use a centralized critic that has access to global information, e.g., V ϕ ( s ( t )) , s ( t ) = { S i ( t ) } M i =1 ∪ { u j ( t ) } N j =1 . (18) The critic is used only for learning; at test time, U A Vs execute π θ without access to s ( t ) or any centralized coordinator . I V . S C E NA R I O S A N D T A S K S W e study multi-U A V deployment under partial observabil- ity and distance-limited communication. Our primary focus is a cooperati ve relay deployment task ( Dr oneConnect ), and we include a secondary mixed cooperative–competiti ve task ( Dr oneCombat ) to demonstrate that the same CTDE graph- based frame work applies beyond purely cooperativ e settings. W e also compare against a static optimization-based formu- lation as a reference upper bound. A. Optimization-Based Static V iew A common abstraction of cov erage and relay placement is to maximize the amount of demand covered within a service radius while penalizing relocation costs, where facilities rep- resent drones and demand represents ground nodes. Let p 0 i be the current position of U A V i , and let p i be its placement decision. A simpliﬁed maximum-coverage objecti ve can be written as max { p i } M i =1 N X j =1 w j z j − α M X i =1   p i − p 0 i   2 (19) s.t. z j = I  min i ∈{ 1 ,...,M } ∥ u j − p i ∥ 2 ≤ r cov  , j = 1 , . . . , N where w j is a node priority weight and r cov is the ser- vice/cov erage radius. This problem is NP-hard; to provide an optimization-based reference, we discretize candidate U A V locations and solve a mixed-integer linear programming (MILP) formulation of (19) ofﬂine, which serves as an approximate upper bound on the attainable coverage for static snapshots. W e highlight that MILP requires intensi ve computation and therefore not suitable for online control. B. Dr oneConnect Scenario DroneConnect models a team of U A V relays repositioning to provide cov erage to mobile ground nodes (Fig. 1). 1) Action Space: For UA V i , the continuous action is a 2D force (or acceleration command) a i = ( F x , F y ) that updates its velocity and position. 2) Observability and Communication Settings: W e ev alu- ate four settings that combine observation and communication constraints: • FO (full observability) : each U A V observes all ground- node states (and U A V states). • PO (partial observability) : each U A V observes only enti- ties within sensing radius r s (i.e., N s ( i ) ). • UC (unrestricted communication) : all U A Vs can ex- change messages (complete communication graph). • RC (r estricted communication) : U A Vs communicate only if within radius r c (i.e., N c ( i ) ). 3) Rewar d: The DroneConnect task is fully cooperati ve: all UA Vs share the same team reward during centralized training, and ex ecute decentralized policies at test time. Let d j ( t ) = min i ∥ u j ( t ) − p i ( t ) ∥ 2 be the distance from node j to its nearest U A V , and let c j ( t ) = I [ d j ( t ) ≤ r cov ] indicate whether node j is covered. W e use the normalized re ward r ( t ) = λ cov · 1 N N X j =1 c j ( t ) − λ dist · 1 N N X j =1 d j ( t ) r cov , (20) where λ cov and λ dist trade off coverage quantity and service quality . C. Dr oneCombat Scenario T o demonstrate generality beyond fully cooperati ve set- tings, we also consider a lightweight adversarial engagement en vironment Dr oneCombat as a secondary task. Drones are split into two teams, and each team aims to eliminate all opponents by ﬁring directional laser beams. W e emphasize that DroneCombat is used here only as an auxiliary scenario; our primary ev aluation is DroneConnect. Fig. 1. DroneConnect en vironment with 2 U A V relays and 4 mobile nodes. 1) Action Space: Each drone uses a 3D continuous action a i = ( F x , F y , F rot ) that controls planar motion and rotation (ﬁring direction). 2) Rewar d: W e use a sparse-and-dense shaped reward (T able II) that encourages successful hits, discourages wasted ﬁring, and rew ards winning quickly . Event Reward Drone i emits a laser beam -0.1 Drone i emits a laser b ut misses any target -1 Drone i ’ s laser hits an opponent +3 Drone i gets hit by a laser beam -3 Each timestep − 50 /T All opponent drones are eliminated +20 T ABLE II: Rewards for the secondary DroneCombat task ( T is episode length). V . E V A L U A T I O N W e ev aluate the proposed CTDE graph-based MARL framew ork on aforementioned two scenarios in Section IV. Our primary focus is the cooperati ve DroneConnect task while DroneCombat is reported as a secondary demonstration of generality . A. Experimental Setup W e report representative settings here to support repro- ducibility . In DroneConnect, U A Vs and nodes move in a bounded 2D area of size 100 × 100 with timestep ∆ t = 0 . 1 s and episode length T = 200 steps. Each UA V senses entities within radius r s and communicates within radius r c (RC); in UC we allow all-to-all messaging. W e train using CTDE actor–critic (PPO-style) with a centralized critic and decen- tralized actors, for 2 × 10 6 en vironment steps and e v aluate ov er 50 episodes. Unless otherwise stated, we a verage results over 3 random seeds. For communication, each U A V transmits a d =64 -dimensional message embedding to each neighbor per timestep; thus the per -step communication cost scales with the av erage degree of the RC graph. B. Dr oneConnect Results W e use M to represent the number of U A Vs and N to represent the number of ground nodes to cov er . The e valuation Fig. 2. A verage coverage per timestep as the number of U A Vs ( M ) and nodes ( N ) vary (FO+UC setting). Method M N Comm Obs Coverage Ratio Ours (CTDE, dual-attention) 3 6 UC FO 0 . 76 ± 0 . 01 Ours (CTDE, dual-attention) 3 6 RC FO 0 . 74 ± 0 . 02 Ours (CTDE, dual-attention) 3 6 UC PO 0 . 72 ± 0 . 02 Ours (CTDE, dual-attention) 3 6 RC PO 0 . 71 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 UC FO 0 . 79 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 RC FO 0 . 77 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 UC PO 0 . 76 ± 0 . 02 Ours (CTDE, dual-attention) 5 10 RC PO 0 . 74 ± 0 . 02 Ablation: no inter-communication 5 10 RC PO 0 . 65 ± 0 . 03 Ablation: no entity attention 5 10 RC PO 0 . 63 ± 0 . 03 Static MILP upper bound 3 6 – – 0 . 77 Static MILP upper bound 5 10 – – 0 . 80 Centralized single-agent RL 3 6 – FO 0 . 75 ± 0 . 02 Centralized single-agent RL 5 10 – FO 0 . 79 ± 0 . 02 T ABLE III: DroneConnect coverage results (mean ± std ov er 3 seeds). UC/RC: unrestricted/restricted communication; FO/PO: full/partial observability . metric is the average coverage r atio over an episode, deﬁned as the number of covered nodes di vided by the total number of nodes. 1) Coverag e Results and Ablations: T able III summarizes cov erage under different observability and communication constraints. Overall, our CTDE approach maintains strong cov erage under partial observ ation and restricted communica- tion, while remaining competitiv e with the static MILP upper bound. W e also report two minimal ablations: disabling inter - U A V communication and replacing entity attention with mean pooling; both degrade performance in the challenging RC+PO (restricted communication and partial observability) setting. 2) Zer o-Shot P olicy Generalization: W e ev aluate zero-shot generalization by applying a policy trained with M =5 UA Vs directly to scenarios with different team sizes (RC+PO), without ﬁne-tuning. T able IV reports representative coverage ratios. Overall, performance remains stable across team sizes, suggesting that the graph-based representation helps the pol- icy generalize across v arying numbers of agents. 3) Qualitative Coor dination and Overlap: In addition to cov erage ratio, we quantify coordination by measuring the T ABLE IV: Cov erage results with zero-shot transfer learning in DroneConnect. Num drones Num nodes Comm Obs Coverage Ratio M − 2 6 RC PO 0.70 M − 1 6 RC PO 0.72 M = 5 10 RC PO 0.74 M + 1 10 RC PO 0.80 M + 2 10 RC PO 0.82 covera ge overlap rate , deﬁned as the fraction of cov ered nodes that are simultaneously within r cov of more than one U A V . In the challenging RC+PO setting with M =5 , N =10 , our learned policy achie ves a low overlap rate 0 . 12 compared to the no-communication ablation 0 . 19 , indicating better di- vision of co verage responsibilities. Figure 3 illustrates repre- sentativ e trajectories and communication links under dif ferent team sizes. (a) M = 3 , N = 6 (b) M = 3 , N = 10 (c) M = 4 , N = 8 (d) M = 5 , N = 10 Fig. 3. Representati ve DroneConnect snapshots sho wing division of coverage tasks. Red links indicate in-range (RC) inter -U A V communication. C. Dr oneCombat Results W e brieﬂy report results on the DroneCombat task (Sec- tion IV -C) to demonstrate that the same CTDE graph- based architecture can be applied to a mixed cooperati ve– competitiv e en vironment and leav e detailed study to future work. W e ev aluate a 5-vs-5 setting and compare against simple baselines. Method 5v5 win rate 5v5 a vg. episode steps Ours (CTDE, dual-attention) 0 . 62 ± 0 . 05 140 ± 18 No-communication ablation 0 . 49 ± 0 . 04 168 ± 22 Independent agents (no CTDE) 0 . 42 ± 0 . 03 175 ± 25 T ABLE V: DroneCombat results (mean ± std ov er 3 seeds). Fig. 4. Sample progression of a 5-vs-5 DroneCombat episode (attackers in pink, defenders in cyan). Eliminated drones are shown in darker colors. V I . C O N C L U S I O N W e presented a centralized training with decentr alized exe- cution (CTDE) multi-agent reinforcement learning framework for cooperative U A V deployment under partial observ ability and communication constraints. Our method represents the en vironment as an agent–entity graph and uses dual attention: agent–entity attention for local en vironment embedding and neighbor self-attention for inter-U A V message aggregation. During ex ecution, each U A V runs a decentralized policy using only local observations and peer-to-peer messages, without any centralized coordinator . In the cooperative DroneConnect task, our approach achiev es high coverage under restricted communication and partial observ ability while remaining competitive with the static MILP upper bound. W e also showed that the learned policy can generalize zero-shot to different team sizes in DroneConnect. Finally , we included a mixed cooperativ e– competitiv e DroneCombat scenario to illustrate that the same architecture can be applied beyond purely cooperati ve set- tings. W e leave detailed wireless QoS models and explicit communication-cost constraints analysis for future work. R E F E R E N C E S [1] S. Yin, Z. Qu, and L. Li, “Uplink resource allocation in cellular net- works with energy-constrained uav relay , ” in 2018 IEEE 87th V ehicular T echnology Confer ence (VTC Spring) , pp. 1–5, IEEE, 2018. [2] E. Fan, A. Peng, M. Caesar , J. Kim, J. Eckhardt, G. Kimberly , and D. Osipychev , “T o wards effecti ve swarm-based gps spooﬁng detection in disadvantaged platforms, ” in MILCOM 2023 - 2023 IEEE Military Communications Conference (MILCOM) , pp. 722–728, 2023. [3] M. Sobouti, R. Mahapatra, and M. A. Rahman, “Utilizing uavs in wireless networks: Advantages, challenges, objectiv es, and solution methods, ” V ehicles , vol. 6, no. 3, pp. 764–789, 2024. [4] D. Chauhan, A. Unnikrishnan, and M. Figliozzi, “Maximum coverage capacitated facility location problem with range constrained drones, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 99, pp. 1– 18, 2019. [5] J. Hu, H. Zhang, and L. Song, “Reinforcement learning for decen- tralized trajectory design in cellular uav networks with sense-and-send protocol, ” IEEE Internet of Things Journal , v ol. 6, no. 4, pp. 6177– 6189, 2019. [6] C. W ang, J. W ang, Y . Shen, and X. Zhang, “ Autonomous na vigation of uavs in lar ge-scale complex environments: A deep reinforcement learn- ing approach, ” IEEE Tr ansactions on V ehicular T echnology , vol. 68, no. 3, pp. 2124–2136, 2019. [7] I. Lee, V . Babu, M. Caesar, and D. Nicol, “Deep reinforcement learning for uav-assisted emergenc y response, ” in MobiQuitous 2020 - 17th EAI International Confer ence on Mobile and Ubiquitous Systems: Computing, Networking and Services , MobiQuitous ’20, (New Y ork, NY , USA), p. 327–336, Association for Computing Machinery , 2021. [8] S. Kaviani, B. Ryu, E. Ahmed, K. Larson, A. Le, A. Y ahja, and J. H. Kim, “Deepcq+: Robust and scalable routing with multi-agent deep reinforcement learning for highly dynamic networks, ” in MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM) , pp. 31–36, 2021. [9] K. D. Julian and M. J. Kochenderfer , “Distributed wildﬁre surveillance with autonomous aircraft using deep reinforcement learning, ” Journal of Guidance, Contr ol, and Dynamics , vol. 42, no. 8, pp. 1768–1778, 2019. [10] A. T ampuu, T . Matiisen, D. K odelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. V icente, “Multiagent cooperation and competition with deep reinforcement learning, ” PLOS ONE , vol. 12, pp. 1–15, 04 2017. [11] J. N. Foerster , G. Farquhar , T . Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients, ” in Pr oceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence and Thirtieth Innovative Applications of Artiﬁcial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence , AAAI’18/IAAI’18/EAAI’18, AAAI Press, 2018. [12] P . Peng, Y . W en, Y . Y ang, Q. Y uan, Z. T ang, H. Long, and J. W ang, “Multiagent bidirectionally-coordinated nets: Emergence of human- lev el coordination in learning to play starcraft combat games, ” arXiv pr eprint arXiv:1703.10069 , 2017. [13] R. Lowe, Y . I. W u, A. T amar , J. Harb, O. Pieter Abbeel, and I. Mor - datch, “Multi-agent actor-critic for mixed cooperative-competiti ve en vi- ronments, ” Advances in neural information pr ocessing systems , v ol. 30, 2017. [14] S. Sukhbaatar , R. Fergus, et al. , “Learning multiagent communication with backpropagation, ” Advances in neural information processing systems , vol. 29, 2016. [15] Y . Hoshen, “V ain: Attentional multi-agent predictive modeling, ” Ad- vances in neur al information pr ocessing systems , vol. 30, 2017. [16] A. Das, T . Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “T armac: T argeted multi-agent communication, ” in Interna- tional Conference on machine learning , pp. 1538–1546, PMLR, 2019. [17] J. Jiang, C. Dun, T . Huang, and Z. Lu, “Graph con volutional reinforce- ment learning, ” arXiv preprint , 2018. [18] J. Gilmer , S. S. Schoenholz, P . F . Riley , O. V inyals, and G. E. Dahl, “Neural message passing for quantum chemistry , ” in Proceedings of the 34th International Conference on Machine Learning - V olume 70 , ICML ’17, p. 1263–1272, JMLR.org, 2017. [19] P . V eli ˇ ckovi ´ c, G. Cucurull, A. Casanova, A. Romero, P . Lio, and Y . Ben- gio, “Graph attention networks, ” arXiv pr eprint arXiv:1710.10903 , 2017. [20] A. Agarwal, S. Kumar , K. Sycara, and M. Le wis, “Learning transferable cooperativ e beha vior in multi-agent teams, ” in Pr oceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS ’20, (Richland, SC), p. 1741–1743, International Foundation for Autonomous Agents and Multiagent Systems, 2020. [21] B. Baker , I. Kanitscheider , T . Markov , Y . W u, G. Powell, B. McGrew , and I. Mordatch, “Emergent tool use from multi-agent autocurricula, ” in International confer ence on learning r epr esentations , 2019.

Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment