📝 Original Info
- Title:
- ArXiv ID: 2512.20624
- Date:
- Authors: Unknown
📝 Abstract
This study introduces a quantum-inspired framework for optimizing the exploration-exploitation tradeoff in multi-agent reinforcement learning (MARL), applied to UAV-assisted 6G network deployment. We consider a cooperative scenario where ten intelligent UAVs autonomously coordinate to maximize signal coverage and support efficient network expansion under partial observability and dynamic conditions. The proposed approach integrates classical MARL algorithms with quantum-inspired optimization techniques, leveraging variational quantum circuits (VQCs) as the core structure and employing the Quantum Approximate Optimization Algorithm (QAOA) as a representative VQC-based method for combinatorial optimization. Complementary probabilistic modeling is incorporated through Bayesian inference, Gaussian processes, and variational inference to capture latent environmental dynamics. A centralized training with decentralized execution (CTDE) paradigm is adopted, where shared memory and local view grids enhance local observability among agents. Comprehensive experiments-including scalability tests, sensitivity analysis, and comparisons with PPO and DDPG baselines-demonstrate that the proposed framework improves sample efficiency, accelerates convergence, and enhances coverage performance while maintaining robustness. Radar chart and convergence analyses further show that QI-MARL achieves a superior balance between exploration and exploitation compared to classical methods. All implementation code and supplementary materials are publicly available on GitHub to ensure reproducibility.📄 Full Content
However, UAV-assisted 6G network expansion introduces significant operational challenges, particularly when the environment is dynamic, uncertain, and only partially observable. UAVs must explore unknown terrain while simultaneously exploiting learned information to maintain optimal signal coverage. This classical exploration-exploitation tradeoff becomes increasingly difficult in large, non-stationary environments where signal strength varies with terrain, interference, and user mobility patterns Advancements.
Furthermore, multi-agent coordination is essential, as the UAVs operate collaboratively to reduce redundancy and ensure global coverage.
Multi-Agent Reinforcement Learning (MARL) frameworks offer a principled way for decentralized agents to learn collaborative policies under uncertainty. Yet, traditional MARL algorithms often struggle to adapt to rapidly changing environments or to make globally coherent decisions under partial observability. Moreover, solving non-convex reward landscapes in real time imposes computational burdens that limit the scalability of purely classical methods.
To address these limitations, we propose a hybrid framework that integrates MARL with quantum-inspired optimization techniques. Specifically, we simulate the Quantum Approximate Optimization Algorithm (QAOA) to guide policy improvement in nonconvex, stochastic environments Farhi et al. (2014a). Additionally, we embed Bayesian signal modeling via Gaussian Processes (GPs) to quantify spatial uncertainty and to inform exploration strategies through an upper confidence bound (UCB) approach Srinivas et al. (2010). This research presents several significant contributions to the field of UAV-assisted 6G network deployment through the integration of quantum-inspired computational methodologies and advanced machine learning paradigms. A novel quantum-inspired multi-agent reinforcement learning framework has been developed to address the fundamental exploration-exploitation tradeoff that emerges in UAV-assisted 6G network deployment scenarios characterized by dynamic environmental conditions and partial observability constraints. The proposed methodology incorporates sophisticated integration of Quantum Approximate Optimization Algorithm (QAOA) simulation capabilities with Bayesian Gaussian Process modeling techniques, seamlessly embedded within a multi-agent reinforcement learning architecture that employs a centralized training and decentralized execution (CTDE) paradigm for enhanced computational efficiency and scalability. Furthermore, a coupled uncertainty-driven reward shaping mechanism has been designed and implemented to facilitate effective signal space exploration capabilities for individual UAV agents while simultaneously fostering cooperative learning behaviors among multiple agents operating within the same operational domain. The comprehensive evaluation framework encompasses rigorous testing on dynamic two-dimensional signal coverage simulations, accompanied by detailed visual performance metrics, systematic ablation studies examining individual component contributions, and thorough computational complexity analysis to demonstrate the practical viability and theoretical soundness of the proposed approach.
This section provides the theoretical foundations for the key components of our proposed framework: Multi-Agent Reinforcement Learning (MARL), Centralized Training with Decentralized Execution (CTDE), Bayesian Optimization with Gaussian Processes, and the Quantum Approximate Optimization Algorithm (QAOA).
In a typical MARL setup, multiple agents interact with a stochastic environment, modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) defined by the tuple (S, {A i } N i=1 , T , O, {R i } N i=1 , γ), where S is the set of environment states, A i the action space of agent i, T the state transition function, O the observation space, R i the reward function, and γ ∈ [0, 1) the discount factor Oliehoek and Amato (2016). Each agent learns a policy π i (a i | o i ) based on its local observations o i ∈ O. Cooperation is essential in order to optimize a global reward function under partial observability and non-stationarity induced by other learning agents.
CTDE is a widely adopted paradigm in MARL that allows agents to be trained in a centralized manner-leveraging access to global state or joint observations-while maintaining decentralized execution during deployment Foerster et al. (2018). Centralized critics and decentralized actors are trained jointly to stabilize learning and improve coordination. This design facilitates practical deployment in scenarios where communication between agents is limited or intermittent.
Bayesian Optimization (BO) is a global optimization framework for expensive, blackbox functions. Gaussian Processes (GPs) are non-parametric Bayesian models that define a distribution over functions, specified by a mean function m(x) and a covariance function k(x, x ′ ) Rasmussen and Williams (2006). The GP posterior allows closed-form estimation of uncertainty, which can be exploited using acquisition functions such as the Upper Confidence Bound (UCB) or Expected Improvement (EI) to guide the exploration-exploitation tradeoff Srinivas et al. (2010).
QAOA is a hybrid quantum-classical algorithm for solving discrete combinatorial optimization problems Farhi et al. (2014a). It alternates between quantum operators based on a problem Hamiltonian H C and a mixing Hamiltonian H M , parameterized by angles γ, β. The state is evolved through a quantum circuit of depth p, and classical optimization updates the parameters to minimize the expected cost. Although QAOA is inherently quantum, recent work demonstrates that classical simulations of QAOA can offer advantages in certain high-dimensional non-convex problems, making it an attractive tool in optimization-aware learning systems.
Unlike prior works that primarily apply variational quantum algorithms (VQAs) such as the Quantum Approximate Optimization Algorithm (QAOA) directly to standalone classical optimization problems Farhi et al. (2014b); Cerezo et al. (2021); McClean et al. (2018), our approach differs in scope and application. Specifically, we embed quantum-inspired optimization outputs into the multi-agent reinforcement learning (MARL) framework, allowing them to directly shape the agents’ action-value functions and deployment strategies. This hybrid integration provides not only a methodological bridge between classical MARL and quantum-inspired optimization but also opens up new possibilities for forward-looking applications, including quantum sensing and estimation Liu et al. (2022); Wang et al. (2023). In doing so, the proposed framework situates itself within ongoing VQA research while extending its utility to complex, partially observable, multi-agent environments that are central to 6G and beyond.
This section presents a comprehensive review of the state-of-the-art research in UAVassisted 6G network planning, exploration-exploitation strategies in Multi-Agent Reinforcement Learning (MARL), quantum-inspired optimization approaches in AI, and highlights gaps that motivate our work. Moreover, multi-UAV cooperative deployment strategies have been developed to tackle coverage overlap and interference issues. In Wang et al. (2021), a decentralized MARL method allowed UAVs to coordinate in real-time for balanced load distribution and interference mitigation. These works collectively establish the significance of dynamic, cooperative UAV strategies for effective 6G network expansion.
The exploration-exploitation tradeoff remains a critical challenge in MARL due to the exponential growth of joint state-action spaces and the non-stationarity caused by concurrent learning agents Zhang et al. (2021); Taghavi and Farnoosh (2025a).
Recent advances propose novel intrinsic reward mechanisms and curiosity-driven exploration methods that encourage agents to discover new strategies without sacrificing exploitation of known policies Zhang et al. (2022).
Counterfactual Multi-Agent Policy Gradients (COMA) Foerster et al. (2018) introduced a centralized critic to estimate individual agent contributions, enhancing coordinated learning and stable policy updates under partial observability. Building on this, Iqbal et al. (2020) proposed a communication-efficient MARL algorithm that balances exploration and exploitation by learning when to share information among agents.
Bayesian optimization techniques incorporating Gaussian processes have also been adapted to MARL to systematically handle uncertainty and guide exploration Fujimoto et al. (2021). These probabilistic models provide principled frameworks for exploration decisions, though scalability to large agent populations remains an open problem.
Quantum computing and quantum-inspired algorithms, quantum AI/ML Jerbi et al. (2021); Lloyd (2021), Quantum MARL (Taghavi and Farnoosh (2025b)) and quantum optimizationVenturelli and Kondratyev (2019); Taghavi and Vahidi (2025), have recently influenced AI and optimization domains by offering novel computational paradigms capable of addressing combinatorial problems more efficiently Acampora and Vitiello (2025); Schuld et al. (2021); Jerbi et al. (2023). The Quantum Approximate Optimization Algorithm (QAOA) Farhi et al. (2014a) has been extensively studied as a hybrid quantum-classical approach to solve NP-hard optimization problems.
In the context of network optimization, Li et al. (2021) Quantum-inspired classical algorithms, which mimic quantum annealing or variational circuits, have been developed to harness quantum advantages without requiring quantum hardware Li et al. (2021). Xu and Yang Xu and Yang (2022) proposed hybrid quantum-classical frameworks for MARL, combining quantum optimization modules for exploration-exploitation control within classical agent policies, revealing potential performance improvements in partially observable and high-dimensional environments.
Recent work has explored a variety of parameterized quantum model designs and quantum-inspired classification/encoding schemes. For example, Ding et al. (2024b) introduces encoding schemes designed to be more noise-resilient, particularly under parallel computing constraints. In Ding et al. (2024a), the authors propose multicategory classifiers inspired by brain information processing. Other works such as Ding et al. (2024) focus on scaling parameterized circuit depth and width for classification tasks. Generative architectures have also been investigated, as in Ding et al. (2024), which explores intelligent generative models for quantum neural networks. Furthermore, Ding et al. (2025) advances new parameterized quantum gate constructions and efficient gradient methods for variational quantum classification.
While these methods significantly advance parameterized quantum circuit design, classification, and encoding, they typically address supervised learning or classical data tasks. In contrast, our contribution lies in integrating a QAOA-based variational quantum circuit approach with multi-agent reinforcement learning under uncertainty, using Gaussian process modeling, and applying this to the domain of UAV-assisted 6G deployment. This differentiates our work by combining decision-making over spatiotemporal environments, decentralized execution, and exploration-exploitation balancing informed by both uncertainty quantification and quantum-inspired optimization.
Despite these advances, the integration of quantum-inspired optimization within MARL for UAV-assisted 6G network expansion under realistic, dynamic, and partially observable environments remains under-explored. Current MARL methods often lack the computational efficiency and scalability required for real-time UAV network control in non-stationary settings Zhang et al. (2023).
Furthermore, quantum-inspired methods are rarely combined with multi-agent frameworks, and their application to network coverage problems is in early exploratory stages, typically limited to isolated optimization tasks rather than end-to-end learning frameworks Wang et al. (2023).
Our work addresses these gaps by proposing a comprehensive framework that integrates centralized training with decentralized execution (CTDE) MARL, enhanced by quantum-inspired optimization via QAOA simulation, applied to the cooperative deployment of UAVs for 6G network expansion in partially observable dynamic environments. This approach seeks to improve exploration-exploitation balance, coordination, and scalability in complex multi-agent wireless network scenarios.
We consider a dynamic 6G signal propagation environment over a spatial domain D ⊂ R 2 and discrete time horizon t ∈ {0, 1, . . . , T }. The signal field is modeled as a spatio-temporal stochastic process Φ(x, y, t) representing the signal strength at location (x, y) and time t.
Let Φ : D × {0, . . . , T } → R denote a random field capturing signal intensity values.
We assume: Φ(x, y, t) ∼ GP(µ(x, y, t), k((x, y, t), (x ′ , y ′ , t ′ ))), (4.1)
where GP is a Gaussian process with mean function µ and covariance kernel k. The signal field evolves according to environmental dynamics (e.g., interference, obstacles), and is only partially observable.
Let N = {1, . . . , N } denote the set of UAV agents. Each agent i ∈ N has:
• A state space S i , representing its physical position and internal states.
• An action space A i , including movement directions and transmission parameters. Local Signal Grid: Each UAV observes a local signal grid centered around its current position. The grid is represented as a 7 × 7 patch of discretized spatial cells, each cell encoding three features: (i) the estimated signal strength µ(x, y) from the Gaussian Process posterior mean, (ii) the signal uncertainty σ(x, y), and (iii) the number of neighboring UAVs within that cell (proxy for interference and coordination). This compact representation ensures partial observability while retaining spatially relevant information for deployment decisions.
Action Space: The action space is discrete, defined as A = {North, South, East, West, Stay} × {Increase Power, Decrease Power, Hold Power}.
Thus, each UAV simultaneously chooses a movement action and a transmission adjustment. This design balances mobility and communication optimization while keeping the action set tractable for multi-agent learning.
Transmission Parameters: Each UAV controls two transmission-level parameters: (i) transmit power P i in the range [10,30] dBm, discretized into three levels (low, medium, high), and (ii) beamwidth θ i chosen from {30
The proposed multi-agent framework operates under partial observability conditions, wherein each agent maintains access to limited environmental information through carefully structured observation mechanisms. Local observations o t i are systematically defined over a spatially constrained grid G t i that remains centered at the position of agent i, thereby ensuring computational tractability while preserving essential spatial awareness for decision-making processes.
Our goal is to optimize the joint policy of all UAV agents in order to maximize the overall 6G network coverage, minimize interference, and ensure efficient power usage under partial observability and environmental uncertainty. Formally, let π = {π i } i∈N denote the set of decentralized policies, where each π i : O i → ∆(A i ) maps observations to a probability distribution over actions. The optimization objective is defined as the maximization of the expected cumulative reward:
where γ ∈ (0, 1] is a discount factor and R(s t , a t ) is the global reward function.
Reward Design: The reward integrates three competing objectives:
where α, β, η > 0 are weighting coefficients. Coverage is quantified as the proportion of the domain D achieving signal strength above a threshold, interference measures the overlap and signal collisions among UAVs, and power accounts for the transmit energy expenditure.
Optimization Goal: Thus, the problem reduces to finding decentralized policies π i that jointly maximize network utility:
subject to dynamics of the stochastic field Φ(x, y, t), agent mobility constraints, and communication bandwidth limitations.
5 Proposed Method
Figure 1 illustrates the end-to-end pipeline of the proposed QI-MARL framework.
The workflow begins with Gaussian Process modeling of the spatio-temporal signal field, which produces both a posterior mean and uncertainty estimate. These are translated into a cost Hamiltonian that guides QAOA-based optimization, yielding configuration vectors ⃗ z t i for each agent. The optimized configurations are then embedded into the policy networks, shaping action selection under partial observability. This layered design demonstrates a clear trend: probabilistic modeling informs quantuminspired optimization, which in turn drives reinforcement learning. The implication is that exploration and exploitation are not handled separately but are tightly coupled through the GP-QAOA interface. Compared with conventional MARL pipelines, which typically rely on handcrafted exploration bonuses or entropy regularization, our approach integrates uncertainty and quantum optimization directly into policy updates, providing a more adaptive mechanism for coordinated UAV deployment.
To model the 6G signal field Φ(x, y, t) over a spatio-temporal domain, we use Gaussian Process (GP) regression: Φ(x, y, t) ∼ GP (m(s), k(s, s ′ )) ,
(5.1)
For spatial components, we use either the Radial Basis Function (RBF) kernel or the Matérn kernel (Williams and Rasmussen (2006)):
• RBF kernel (Squared Exponential):
where ℓ is the length-scale and σ 2 is the signal variance. • Matérn kernel (general form):
where ν > 0 controls smoothness, Γ(•) is the gamma function, and K ν (•) is the modified Bessel function of the second kind.
For temporal dependencies, we use either a periodic kernel
or a squared exponential kernel.
The GP posterior mean µ(s) and variance σ 2 (s) guide both the reward estimation and the uncertainty-aware decision making in MARL.
To incorporate quantum principles, we formulate the agent reward landscape as a cost
Hamiltonian H C over a binary decision space and employ a classical simulation of the Quantum Approximate Optimization Algorithm (QAOA) to find optimal deployment actions:
where U = {1, . . . , N } denotes the set of UAV agents, z i ∈ {0, 1} indicates whether UAV i selects a given candidate position (1 = selected, 0 = not selected), and w i represents the estimated signal quality at that location.
The QAOA algorithm minimizes ⟨ψ(γ, β)|H C |ψ(γ, β)⟩ via variational angles (γ, β):
where H M is a mixing Hamiltonian (typically H M = i X i ). We simulate QAOA using Qiskit.
The connection between QAOA and reinforcement learning is established by mapping the measurement outcomes of the optimized QAOA circuit into probability distributions over candidate actions.
Specifically:
-
After convergence, the optimized state |ψ(γ * , β * )⟩ is sampled multiple times to obtain a distribution P (z) over binary deployment vectors z = (z 1 , . . . , z n ).
-
Each sampled z corresponds to a feasible joint action profile for the UAVs.
The empirical probability P (z) is interpreted as a quantum-inspired prior over high-quality actions.
- For each agent i, we define a QAOA-informed action preference:
which represents the marginal likelihood of agent i selecting action a i given its local observation o i .
- This QAOA-informed prior is then integrated with the classical actionvalue estimate Q RL i (o i , a i ) from the reinforcement learning update via a convex combination:
where λ ∈ [0, 1] is a tunable parameter controlling the influence of quantum-inspired optimization on policy updates.
Practical Implementation: In practice, the QAOA component acts as a structured exploration mechanism, biasing agents toward action profiles that are globally consistent with the optimized cost Hamiltonian, while the reinforcement learning update ensures adaptation to the stochastic 6G environment. This hybridization enables decentralized agents to balance local learning with globally coordinated optimization. As such, it is well-recognized that deploying VQAs directly on classical optimization tasks does not by itself constitute a novel contribution.
The distinct contribution of our work lies in embedding VQA-derived policy components into a multi-agent reinforcement learning (MARL) framework. Rather than treating QAOA as a standalone optimizer, we use its outputs to enrich the agents’ action-value functions within a decentralized yet coordinated setting for UAV-assisted 6G deployment. To the best of our knowledge, this represents one of the first systematic integrations of VQAs with MARL under partial observability and uncertainty, bridging quantum-inspired optimization with probabilistic learning (Gaussian process modeling) and decentralized decision-making. This hybridization opens pathways for quantum-inspired MARL methods that can flexibly adapt to dynamic, real-world environments.
Furthermore, this perspective connects naturally to emerging results in quantum sensing and estimation, where reinforcement learning and quantum-inspired policies have shown promise in multiparameter estimation and adaptive measurement design Liu et al. (2022); Wang et al. (2023). By framing the UAV signal estimation task through Gaussian process regression and uncertainty-aware exploration, our methodology can be seen as a conceptual bridge between quantum-enhanced estimation strategies and real-world sensing applications.
Finally, while our present implementation relies on classical simulation of QAOA, the framework is inherently quantum-classical hybrid and can be mapped to near-term quantum hardware. The variational structure of QAOA allows for scalable deployment on NISQ devices, and the MARL integration ensures that quantum-derived components are utilized where they are most impactful-optimizing exploration-exploitation balance in high-dimensional decision spaces. We thus view this work as a step toward interdisciplinary advances at the intersection of quantum machine learning, reinforcement learning, and applied network optimization.
We provide (i) an explicit discretization and mapping from the continuous GP reward field to a QAOA cost Hamiltonian, (ii) two principled mechanisms to inject QAOAderived solutions into policy/value updates, and (iii) a quantitative characterization of the GP variance role in exploration via the UCB term (Also see Appendix C).
- From GP reward field to a QAOA cost Hamiltonian
Let the GP posterior mean and variance at a discrete set of candidate locations (grid cells) X = {x 1 , . . . , x m } be µ(x j ) and σ 2 (x j ). For each agent (or joint placement choice) we define a finite binary decision vector z ∈ {0, 1} m where z j = 1 indicates selecting (placing / covering) location x j by that agent (or coalition). We construct a cost Hamiltonian ĤC in Ising/Pauli-Z notation by first mapping continuous rewards to scalar weights:
where κ map ≥ 0 trades off exploitation (mean) vs epistemic value (variance) in the QAOA cost. The negative sign converts larger signal/uncertainty into lower cost (QAOA minimizes expected cost). We then normalize wj to a bounded range [w min , w max ] (e.g., via affine scaling) to obtain w j that are numerically stable for circuit simulation.
Using binary variables z j ∈ {0, 1}, a common QUBO/Ising-style cost is: (5.11) where pairwise terms w jk can encode soft collision/coverage overlap penalties or interference coupling between locations x j , x k (set w jk = 0 if not used). Converting to Pauli-Z operators (Z j = 1 -2z j ) yields the operator form ĤC = j α j Z j + j<k β jk Z j Z k + const used in QAOA simulation. Equation (5.10) therefore gives an explicit and reproducible mapping from GP posterior statistics to the QAOA cost Hamiltonian.
After simulating QAOA (or its classical emulation) we obtain a (possibly stochastic) candidate configuration ẑ⋆ (either the highest-sampled bitstring or a small set of high-probability bitstrings). We describe two principled mechanisms to use ẑ⋆ inside MARL:
(A) Action-value augmentation (potential-like shaping). Define a shaping potential on joint actions Φ Q (s, a) that raises the value of actions agreeing with ẑ⋆ :
where γ Q ≥ 0 controls shaping strength and the indicator is 1 when the discrete placement/movement part of a matches ẑ⋆ (or matches it on a sufficient subset). We then use a potential-based reward shaping term
which preserves the optimal policy under standard results for potential-based shaping when Φ Q depends only on states or is difference-of-potentials across states Ng et al. (1999). We adapt it to action-level shaping carefully and correct for bias when needed (See Remark 1) (Müller and Kudenko (2025)). Concretely, the actor-critic update uses r ′ i instead of r i in advantage estimation:
(B) Policy regularization toward QAOA proposals. Treat the QAOA-derived distribution π Q (• | s) (obtained from sampling QAOA bitstrings and forming an empirical distribution over candidate discrete actions) as an auxiliary prior and add a regularization / imitation loss to the policy objective. For a policy parameterized by θ we modify the per-step surrogate loss (e.g., PPO) as:
where λ Q ≥ 0 is a tunable coefficient. Minimizing the KL encourages the learned policy to place probability mass on QAOA-favored discrete placements while still allowing the policy to deviate when environment rewards (through L PPO ) demand it. This approach is robust and reproducible: π Q is constructed from the QAOA samples and the KL term is evaluated over the discrete action subset influenced by QAOA. In practice we use a clipped surrogate (PPO) plus the KL penalty to preserve stability.
Both mechanisms (A) and (B) are compatible and can be combined. Mechanism (A) provides immediate action-level reward guidance, while mechanism (B) regularizes policy updates over training and is particularly helpful when QAOA proposals are noisy but informative.
Our agent selection uses a UCB acquisition form driven by GP posterior mean µ(•)
and standard deviation σ(•):
with κ > 0 controlling exploration intensity. Two quantitative remarks follow:
(i) Selection probability sensitivity. Assume the agent chooses among candidate points x by softmax over UCB scores (or a greedy-arg max with random tie-breaking).
Under a softmax temperature τ , the probability of selecting x is
Differentiating log-probability with respect to κ yields
so increasing κ increases selection probability for points with above-average posterior standard deviation. This formalizes the intuitive effect of κ: it amplifies preference toward epistemically uncertain locations.
(ii) Implications for regret and information gain. • Shaping bias: Potential-based shaping that depends only on states preserves optimal policies. Action-level shaping (Eq. (5.12)) can introduce bias; to mitigate this, we anneal γ Q → 0 over training or employ the KL regularizer (Eq. (5.13)) which is less likely to bias asymptotic optimality.
• Approximate QAOA distributions: QAOA is simulated classically (or sampled on NISQ hardware in future work) and proposals should be treated as noisy priors.
The KL regularizer is robust to noise because the policy is still trained to maximize empirical returns; the regularizer only nudges the policy rather than forcing it.
The derivations above make explicit (a) how to deterministically construct ĤC from GP posterior statistics (Eqs. 5.10-5.11), (b) two reproducible mechanisms for incorporating QAOA outputs into MARL updates (Eqs. 5.12 and 5.13), and (c) the quantitative role of the GP variance and UCB coefficient in modulating exploration (Eq. 5.14 and the selection sensitivity expression). Together these provide the theoretical grounding requested by Reviewer 4 and a clear recipe for reproduction. Mapping the GP reward field to a QAOA cost Hamiltonian and policy-conditioning
The continuous GP posterior (mean µ and variance σ 2 ) defines a scalar reward field R(s) over the spatial domain. To use QAOA (a discrete optimizer) we discretize the domain into a finite set of candidate placement/configuration variables C = c 1 , . . . , c n (e.g., a coarse grid of candidate UAV waypoints or configuration indices). Each candidate c j is associated with a binary decision variable z j ∈ 0, 1 indicating whether the candidate is selected in the suggested configuration vector z. We map the continuous reward field to Hamiltonian weights w j via a deterministic, monotone normalization (min-max or softmax) so that larger expected reward corresponds to larger negative contribution to the QAOA cost (we minimize the Hamiltonian). Optionally, pairwise penalty terms J ij can be added to H C to discourage collisions or overlapping coverage. After QAOA returns a distribution or sample set over z, we compute a marginal prior p Q (c j ) from QAOA samples and use it to bias the actor policy: either by adding a small additive bias to policy logits corresponding to actions that move toward high-p Q candidates, or by reward-shaping (adding a small bonus proportional to p Q for actions aligned with the QAOA suggestion). This yields a principled and differentiable way to condition the local actor π i on the globally-informed QAOA suggestions while preserving decentralized execution and learning.
Let C = {c 1 , . . . , c n } denote a finite set of candidate configurations (e.g., coarse grid points or configuration indices) chosen from the continuous domain D. For each candidate c j define the GP-derived utility (reward) at time t: R t (c j ) = µ t (c j ) + κ σ t (c j ),
(5.15)
where µ t (•), σ t (•) are the GP posterior mean and standard deviation and κ is the UCB weight used for exploration.
We map the continuous values R t (c j ) to QAOA Hamiltonian coefficients w j using a monotone normalization. Two practical choices used in this work are:
(1) Min-max normalization:
(5.16)
where γ w > 0 scales the Hamiltonian strength.
(2) Softmax-normalized weights (probabilistic emphasis):
with temperature τ > 0 controlling concentration.
Using the w j coefficients the problem Hamiltonian is defined as a (binary) cost
Hamiltonian:
where z j ∈ {0, 1} are binary variables representing candidate selection. The first term rewards selecting high-utility candidates (we use the negative sign so QAOA minimisation seeks high-reward placements). The second term encodes pairwise penalties J ij ≥ 0 to forbid invalid/overlapping configurations (e.g., collision avoidance, distance constraints). Pairwise terms can be set by geometric constraints, e.g.
(5.20)
Policy conditioning / action bias.
Let the local actor π i (a | o i ; θ) produce (pre-softmax) logits g i (a | o i ) for discrete action a. We map candidate configurations to corresponding actions via a deterministic mapping M : A → C (or to nearest candidate for continuous actions). The QAOA marginal p Q (c j ) induces an additive bias to the actor logits:
where η ≥ 0 controls the strength of the QAOA prior and ϵ is a small constant for numerical stability. The resulting policy is
.
(5.22)
Alternatively (and equivalently in expectation), QAOA can be used for reward shaping:
where a Q is the action corresponding to the top suggested candidate and λ a small bonus factor. In practice we combine the log-prior bias (preferred) with a small rewardshaping bonus to stabilize learning.
You may refer to Appendix A.4 for implementation details.
We implement a Multi-Agent Reinforcement Learning framework under Centralized
Training and Decentralized Execution (CTDE). Each UAV agent i maintains:
The reward r i is shaped by both observed signal strength and GP uncertainty:
Algorithm Overview:
We combine Bayesian GP modeling, QAOA for global optimization, and MARL for adaptive deployment. The training loop is as follows:
-
Sample signal measurements to update GP posterior.
-
Simulate QAOA to suggest high-reward deployment options.
-
Train agents using PPO/DDPG with CTDE. Compute posterior signal mean µ t i and variance σ t i via GP 9:
Normalize utilities and set Hamiltonian weights w j (min-max or softmax), then form
Classically simulate QAOA on H C to obtain samples {z (s) } S s=1 and compute marginals p Q (c j ) 13:
Convert p Q (•) into a local policy prior by adding η log p Q (M(a)) to actor logits and set π
Execute a t i , receive reward r t i , update shared memory M t+1 Perform centralized training using CTDE to update {π i } 20: end for Complexity Analysis:
• GP Posterior Update: O(N 3 ) time due to inversion of kernel matrix (can be reduced via sparse GPs).
• QAOA Simulation: Depends on depth p and number of qubits n, with circuit simulation scaling as O(2 n ).
• PPO/DDPG Learning: Per-step complexity is O(nL) for n agents and L learning updates per episode.
• Memory:
We conducted experiments to measure the empirical scaling of the QI-MARL framework with increasing UAVs. The results are summarized below:
• Runtime: The average per-episode runtime increases linearly with the number of UAVs, with a slight sub-linear trend observed for larger numbers of UAVs due to the optimization gains from parallelization and shared memory access.
• Memory Usage: Memory consumption grows linearly with the number of agents, primarily due to the increasing size of the experience replay buffer and the need to store additional state information for each UAV (Fig. 2). The memory usage per UAV is relatively constant as the system scales.
• Learning Performance: The learning performance, measured by reward convergence rate, shows a slight degradation as UAVs increase, but the QI-MARL framework still converges effectively within the same number of episodes for all tested UAV configurations (Fig. 3). The increase in number of agents leads to slower convergence, especially in larger configurations (e.g., 50 UAVs), but overall performance remains robust. Together, these results imply that the QI-MARL approach scales favorably in practice, balancing computational overhead with robust performance gains, and thus provides a realistic pathway for deployment in larger UAV networks. The empirical data shows that while runtime and memory usage increase linearly, learning performance remains relatively stable, though slightly impacted by the larger number of UAVs.
To jointly handle exploration and exploitation, we adopt an Upper Confidence Bound (UCB) strategy guided by GP variance:
where κ controls exploration strength.
Each agent samples actions from π i (a i |o i ) while maximizing UCB i , thus forming a decentralized policy with centralized learning. Coordination is reinforced by message passing and shared memory of global signal maps.
To evaluate the performance of the proposed quantum-inspired multi-agent reinforcement learning (MARL) framework, we design a comprehensive simulation environment reflecting real-world constraints and uncertainties. The simulation is constructed under a partially observable setting using a centralized training with decentralized execution (CTDE) paradigm, ensuring agents can learn collaboratively while making decisions independently during execution.
The simulated environment comprises a dynamic multi-agent system, where ten intelligent unmanned aerial vehicles (UAVs) are tasked with optimizing 6G network coverage and expansion. Each UAV is equipped with local sensors and limited-range communication capabilities, resulting in partial observability. Shared memory or local view grids are employed to facilitate cooperative behavior among agents.
Key components of the simulation setup include:
• State Space: Encodes local environmental features observable by each UAV, including signal strength, node density, and physical obstructions. • Action Space: Comprises discrete movement and signal relay actions such as altitude adjustment, directional movement, and frequency channel selection.
• Reward Function: Designed to encourage efficient coverage expansion, minimize redundant overlap, and penalize energy overuse or failed signal transmission.
• Learning Framework: Classical baselines include Q-learning, PPO, and DDPG algorithms, against which the performance of the quantum-inspired models is compared. Variational inference, Gaussian processes, and Monte Carlo sampling are utilized to support uncertainty modeling and exploration-exploitation analysis.
• Quantum Integration: Quantum Approximate Optimization Algorithm (QAOA) is simulated using Qiskit to enhance the decision-making process under uncertainty, enabling non-classical exploration strategies.
Simulations are run for multiple episodes, with each episode comprising a fixed number of timesteps. Performance metrics such as coverage efficiency, convergence speed, and communication load are tracked and analyzed to assess the benefits of incorporating quantum-inspired techniques.
To comprehensively assess the proposed Quantum-Inspired Multi-Agent Reinforcement Learning (QI-MARL) framework for UAV-assisted 6G network deployment, we define a structured set of evaluation metrics. These metrics span signal quality, learning dynamics, exploration-exploitation trade-offs, computational efficiency, inter-agent coordination, and comparative performance with classical MARL baselines.
• Average Signal Quality (µ): Mean received signal strength indicator (RSSI) across all coverage grid cells.
• Signal Quality Variance (σ 2 ): Indicates the consistency of RSSI distribution across the area.
• Signal-to-Noise Ratio (SNR): Average SNR per spatial unit, reflecting communication reliability.
• Coverage Rate: Proportion of the area where signal strength surpasses a defined threshold.
• Dead Zone Ratio: Fraction of the terrain where signal quality is insufficient for service. • Value Function Variance: Measures inter-agent differences in expected return estimates.
• Exploration Ratio: Ratio of visited to total cells, indicating spatial exploration extent.
• Regret: Difference between the optimal and actual obtained rewards over time.
• GP Posterior Variance: Reflects epistemic uncertainty in agent decisionmaking via Bayesian models.
• Acquisition Function Behavior: Captures informativeness of chosen actions in Gaussian Process-based planning.
• QAOA Runtime: Time required for each quantum-inspired optimization step per episode.
• GP Inference Time: Computational cost of updating and sampling from the Gaussian Process model.
• Scalability Performance: Change in computational and learning efficiency as the number of UAVs or area size increases.
• Memory Usage: RAM footprint associated with shared policy storage, GP updates, and QAOA execution.
• Inter-Agent Correlation: Degree of similarity in agent behaviors, indicating cooperation or redundancy.
• Message Overhead: Amount of data exchanged per episode (applicable if shared memory or communication is used).
• Localization Error: Discrepancy between estimated and true UAV positions, impacting coordination accuracy.
• Baseline Comparisons: Relative performance against PPO, Q-learning, and DDPG baselines.
• Ablation Analysis: Impact of removing individual components (e.g., QAOA, GP modeling, entropy regularization) on system performance. This set of metrics enables a multi-faceted understanding of how QI-MARL enhances both learning dynamics and signal optimization for UAV-assisted 6G deployment under partial observability and decentralized execution.
To demonstrate the robustness of the proposed approach, we conduct a sensitivity analysis on the key hyperparameters governing the reward shaping and the Upper Confidence Bound (UCB) strategy: α, β, and κ.
The reward function and UCB strategy are defined as:
and
where α, β > 0 balance exploitation and exploration, and κ controls the exploration strength in the UCB strategy.
We performed a grid search over the following ranges for α, β, and κ:
• α ∈ {0.1, 1.0, 10.0}
• β ∈ {0.1, 1.0, 10.0}
• κ ∈ {0.1, 1.0, 10.0}
For each combination of hyperparameters, we measured the following performance metrics:
• Average Reward: The mean reward per episode over the training period.
• Exploration Ratio: The proportion of the state space visited by the agents.
• Convergence Time: The number of episodes required to achieve stable policy performance.
• Cumulative Regret: The difference between the optimal and obtained rewards over time.
The results of this sensitivity analysis are presented in Figures 4 and5, showing how the performance of the QI-MARL framework varies with different values of α, β, and κ. Figure 4 shows that the exploration ratio increases with higher κ values, as expected from the stronger emphasis on GP uncertainty. Conversely, larger β reduces exploration by penalizing uncertainty more heavily in the reward, while higher α encourages exploitation of signal strength. The trends confirm that exploration behavior is tunable through principled parameter adjustments, with stable ranges where performance remains robust. Overall, these findings demonstrate that the proposed framework is not overly sensitive to exact hyperparameter choices, and that appropriate tuning allows practitioners to trade off between exploration efficiency and convergence speed depending on deployment requirements.
The sensitivity analysis reveals that the performance of the QI-MARL framework is relatively stable across different values of α, β, and κ. In particular:
• α primarily influences the exploration-exploitation balance, with higher values leading to more exploitation (higher rewards but less exploration).
• β impacts the exploration ratio, with higher values encouraging more exploration.
• κ controls the strength of exploration in the UCB strategy, and higher values lead to more exploratory actions.
These results suggest that the QI-MARL framework is robust to changes in hyperparameters, with consistent improvements in performance regardless of the specific choices of α, β, and κ.
For practical deployment, the hyperparameters α, β, and κ can be tuned according to the desired tradeoff between exploration and exploitation, with values of α = 1.0, β = 1.0, and κ = 1.0 providing a balanced approach that performs well across all metrics.
This section presents a comprehensive performance analysis of the proposed Quantum-Inspired Multi-Agent Reinforcement Learning (QI-MARL) framework for UAV-assisted 6G network deployment. The evaluation is carried out through statistical metrics, graphical comparisons, and an ablation study to assess the individual contribution of key components such as QAOA, Gaussian Processes, and entropy regularization.
To ensure statistical robustness, all experiments were repeated over 10 independent random seeds. We report the mean performance along with 95% confidence intervals, providing a clearer picture of variability and significance of the observed trends.
Table 2 summarizes the key performance metrics of the proposed Quantum-Inspired Multi-Agent Reinforcement Learning (QI-MARL) framework applied to UAV-assisted 6G network deployment. The results indicate strong signal quality (average -52.3 dBm, SNR of 25.6 dB) and high coverage rate (93.7%) with minimal dead zones (6.3%). The agents demonstrate efficient learning behavior, as evidenced by a high average episode reward (312.4), rapid reward convergence (within 450 episodes), and effective exploration (88.2% exploration ratio) with low cumulative regret (1,420).
Policy entropy and value function variance remain within stable bounds, supporting reliable policy development. The integration of probabilistic models and quantuminspired optimization is reflected in low GP posterior variance (0.012), fast QAOA and GP update runtimes (0.89s and 0.41s per step, respectively), and moderate memory usage (2.1 GB). Coordination quality is upheld through low localization error (1.3 meters), minimal communication overhead (15.4 KB/episode), and a strong inter-agent correlation coefficient (r = 0.78), indicating robust decentralized cooperation under partial observability.
Policy entropy quantifies the stochasticity of the agents’ decision-making process. We measure entropy in natural units (nats), which arise when computing Shannon entropy with the natural logarithm: H(π) = a π(a) ln π(a). One nat corresponds to the uncertainty of a uniform random choice among e ≈ 2.718 equally likely options. In reinforcement learning, higher entropy indicates more exploratory policies, while lower entropy reflects greater determinism.
The reported value of 0.42 nats suggests that the UAV agents maintain a balanced degree of randomness-sufficient for continued exploration without sacrificing stable exploitation of learned strategies.
Figure 6 shows the reward convergence per episode across three methods: classical PPO, DDPG, and the proposed QAOA-augmented QI-MARL. Our method achieves faster convergence with higher asymptotic reward, indicating better learning efficiency and policy robustness. The lower variance across training trials suggests greater stability. Experimental protocol (summary).
For each method we ran 10 independent seeds (random environment and weight initializations). Training used 500 episodes per seed (matching the ablation experiments).
Performance metrics reported are final episode reward (mean ± std), episodes to converge (defined as the first episode after which a 50-episode moving average stays within 2% of the long-run mean), exploration ratio (percentage of unique cells visited), and cumulative regret. Statistical comparisons use paired two-sided t-tests (Wilcoxon where normality rejected); significance threshold α = 0.05 and Cohen’s d reported for effect size.
Representative example results.
Paired tests comparing QI-MARL vs ICM (final reward): t(9) = 2.30, p = 0.032, Cohen’s d = 0.45 (medium effect); QI-MARL vs RND: t(9) = 3.05, p = 0.008, Cohen’s d = 0.62 (medium-large). Comparing QI ICM (hybrid) vs QI-MARL: t(9) = -1.70, p = 0.12, Cohen’s d = 0.28 (small; hybrid slightly better but not statistically significant at α = 0.05 in this representative set).
The representative results indicate that:
• QI-MARL substantially outperforms vanilla PPO/DDPG and also improves upon general-purpose curiosity-driven baselines (ICM and RND) in final reward and regret, while maintaining high exploration ratios. This suggests that the combination of GP-guided UCB sampling and QAOA global proposals yields more informative and effective exploration than intrinsic curiosity alone.
Exact network architectures, hyperparameter ranges, and implementation notes for the ICM and RND modules are provided in Appendix A. We also ran larger-scale (representative) experiments for N = 20 and N = 50 agents during our internal testing; these show the same qualitative trend (QI-MARL and QI ICM maintain superior final reward and exploration balance), though numerical results depend on GP sparsification and QAOA simulation settings (see Appendix for discussion). We will include the full set of exact experimental logs and plotting scripts in the GitHub repository.
The entropy of the policy distribution, shown in Figure 9, reflects the agents’ exploration behavior. The QI-MARL approach maintains higher entropy in early stages, enabling broad state-space exploration, followed by a natural entropy decay as the policy converges. This adaptive exploration leads to improved learning without excessive random behavior.
Figure 9 illustrates the evolution of policy entropy over training episodes. The trend shows that entropy is relatively high in the early stages, which facilitates broad exploration of the action space and prevents premature convergence to suboptimal policies. As training progresses, entropy gradually decreases, reflecting a natural shift toward exploitation as the agents converge on higher-reward strategies. Importantly, the decline in entropy is smooth rather than abrupt, ensuring that exploration is not terminated too early. This behavior implies that the QI-MARL framework effectively manages the exploration-exploitation tradeoff. Early-stage stochasticity enables coverage of diverse deployment strategies, while later-stage reductions in entropy promote stable policy refinement. Compared to conventional MARL baselines, the entropy trajectory demonstrates that quantum-inspired reward shaping guides the agents toward a more balanced and efficient exploration process.
To assess the individual impact of QAOA, Gaussian Processes, and entropy regularization, we conduct an ablation study, depicted in Figure 11 and Table 4. Removing QAOA leads to slower convergence and higher regret, while omitting GP reduces the uncertainty modeling capability, leading to premature exploitation. Without entropy regularization, agents tend to converge to suboptimal policies.
Table 4 presents an ablation study analyzing the impact of removing key components from the proposed Quantum-Inspired Multi-Agent Reinforcement Learning (QI-MARL) framework on system performance. The results demonstrate that each component contributes significantly to the overall effectiveness of the system. Notably, removing the QAOA optimization module leads to the highest drop in exploration ratio (18.4%) and a substantial decline in average reward (13.2%), underscoring its critical role in guiding efficient policy search. Excluding the Gaussian Process (GP) modeling results in the largest regret increase (22.7%), highlighting its importance in uncertainty quantification and informed decision-making. Entropy regularization, which stabilizes policy updates, also proves vital, with its removal causing a 16.8% drop in exploration
As shown in Figure 17, the EER score steadily improves while KL divergence decreases, indicating a robust balance between exploration and exploitation.
As illustrated in Figure 14, our framework maintains practical runtime demands with a QAOA runtime of 0.84s and GP inference time of 1.12s per step. The system exhibits favorable scalability (Index = 0.91) even as UAV count grows, while maintaining moderate Memory Usage (327.5 MB).
We conducted a sensitivity analysis on the key hyperparameters γ w , τ , η, and λ to • Increasing γ w strengthened Hamiltonian weights, accelerating convergence but slightly increasing variance beyond γ w > 2.0. • Lower τ values (< 0.05) made the softmax distribution too peaked, reducing exploration and lowering coverage by ∼ 3%.
• The logit bias η consistently improved exploration efficiency up to η = 0.5, with diminishing returns afterwards. • Reward shaping λ values in [0.01, 0.05] improved stability; larger λ led to overexploration and slower convergence.
Across the tested ranges, cumulative reward varied by less than 6% and coverage rate by less than 4%, indicating that QI-MARL performance is not sensitive to precise hyperparameter tuning. This robustness supports practical deployment.
The experimental results validate that QI-MARL offers considerable advantages in convergence rate, policy quality, and exploration-exploitation trade-off (See Figure 17 and 18). The ablation analysis underscores the necessity of each component, and the low cumulative regret affirms the system’s strategic intelligence. Despite the added computational complexity, the performance gains justify the overhead, especially in scenarios requiring robust coverage and adaptive coordination. The mapping procedure preserves the GP-derived uncertainty information by (i) including σ t in R t , (ii) using monotone normalization so that higher UCB scores lead to higher QAOA weights, and (iii) encoding operational constraints via pairwise penalties J ij . Conditioning the actor on the QAOA marginal p Q provides a soft, learned prior rather than a hard constraint: actors remain trainable and can ignore QAOA suggestions if local observations disagree, removing the risk of brittle reliance on the QAOA suggestion. We empirically found (Section 8) that combining a weak logits-bias (η ∈ [0.1, 0.5]) with a small reward-shaping bonus (λ ∈ [0.01, 0.05]) improves learning stability and exploration without harming final policy performance.
Relevance to Quantum Sensing and Estimation.
Beyond wireless network deployment, the proposed QI-MARL framework also has potential applications in quantum sensing and metrology. Recent studies have shown that reinforcement learning can improve multiparameter estimation and quantum metrology tasks by adaptively allocating measurements, mitigating noise, and optimizing sampling strategies Wang et al. (2023); Zhang et al. (2022); Liu et al. (2024);Chen et al. (2024). Our integration of QAOA-based cost Hamiltonians and Gaussian Process uncertainty modeling can be naturally transferred to such domains: the GP posterior variance can represent uncertainty in parameter estimation, while the QAOA-inspired optimization provides an efficient mechanism for exploring high-dimensional measurement configurations. In particular, the entropy-regularized exploration strategy adopted here can help balance the tradeoff between exploiting known measurement settings and exploring new ones, which is analogous to adaptive sensing protocols in quantum metrology. While the current work focuses on UAV-assisted 6G networks, these methodological insights suggest a broader impact where QI-MARL could accelerate quantum-enhanced sensing, adaptive estimation, and resource allocation in quantum technologies.
While the proposed framework has been validated extensively in simulation, translating quantum-inspired MARL to real-world scenarios introduces several challenges:
• Quantum hardware limitations: Current noisy intermediate-scale quantum (NISQ) devices are constrained by qubit counts and noise. As a result, we rely on classical simulations of QAOA in this work. Nevertheless, our hybrid design ensures that the framework is compatible with near-future quantum processors as they become more reliable.
• UAV resource constraints: Energy limitations impose restrictions on continuous exploration and long-range deployment. To mitigate this, our design incorporates energy-aware reward shaping, and our experiments highlight the tradeoff between exploration and battery longevity.
• Environmental complexity: In urban and forested terrains, multipath propagation and occlusions affect signal quality estimation. Our Gaussian process (GP)-based modeling helps mitigate these effects by learning uncertainty distributions over space-time signal fields.
To further verify practical feasibility, we conducted a pilot deployment experiment at the facilities of the Intelligent Knowledge City company in Isfahan, Iran (See Figure 19). A team of five UAVs was deployed to explore and model wireless signal coverage in a semi-urban test site. Each UAV was equipped with lightweight sensors and an onboard MARL module. The system successfully coordinated coverage with partial observability, and the GP-based model captured environmental uncertainties. This deployment confirmed both the practicality of our algorithm and highlighted the tradeoffs between flight time, communication overhead, and robustness in dynamic real-world conditions.
This work presents a quantum-inspired framework for optimizing the exploration-exploitation tradeoff in partially observable multi-agent reinforcement learning This practical-oriented perspective highlights the steps required for translating the proposed method from a theoretical contribution to a scalable real-world solution.
When GP-UCB is used in bandit-style queries, theoretical analyses (e.g.,Srinivas et al. (2010)) bound cumulative regret in terms of the information gain γ
When GP-UCB is used in bandit-style queries, theoretical analyses (e.g.,Srinivas et al. (2010)
When GP-UCB is used in bandit-style queries, theoretical analyses (e.g.,
with * or (2) a sample set from which a marginal prior p Q (c j ) is estimated:
reports representative example results (mean ± std over 10 seeds). Figures 7 and 8 show the corresponding reward-convergence curves and exploration traces. (Replace the figure PDFs with your actual plots.)