Meta Policy Switching for Secure UAV Deconfliction in Adversarial Airspace
Autonomous UAV navigation using reinforcement learning (RL) is vulnerable to adversarial attacks that manipulate sensor inputs, potentially leading to unsafe behavior and mission failure. Although robust RL methods provide partial protection, they often struggle to generalize to unseen or out-of-distribution (OOD) attacks due to their reliance on fixed perturbation settings. To address this limitation, we propose a meta-policy switching framework in which a meta-level polic dynamically selects among multiple robust policies to counter unknown adversarial shifts. At the core of this framework lies a discounted Thompson sampling (DTS) mechanism that formulates policy selection as a multi-armed bandit problem, thereby minimizing value distribution shifts via self-induced adversarial observations. We first construct a diverse ensemble of action-robust policies trained under varying perturbation intensities. The DTS-based meta-policy then adaptively selects among these policies online, optimizing resilience against self-induced, piecewise-stationary attacks. Theoretical analysis shows that the DTS mechanism minimizes expected regret, ensuring adaptive robustness to OOD attacks and exhibiting emergent antifragile behavior under uncertainty. Extensive simulations in complex 3D obstacle environments under both white-box (Projected Gradient Descent) and black-box (GPS spoofing) attacks demonstrate significantly improved navigation efficiency and higher conflict free trajectory rates compared to standard robust and vanilla RL baselines, highlighting the practical security and dependability benefits of the proposed approach.
💡 Research Summary
The paper tackles the vulnerability of reinforcement‑learning (RL) based autonomous UAV navigation to adversarial attacks that corrupt sensor inputs, potentially causing unsafe behavior and mission failure. While existing robust RL methods provide limited protection, they typically assume a fixed perturbation model and therefore struggle to generalize to unseen or out‑of‑distribution (OOD) attacks. To overcome this limitation, the authors propose a meta‑policy switching framework that dynamically selects among a pre‑trained ensemble of robust policies, each trained under a different perturbation intensity.
The core of the framework is a Discounted Thompson Sampling (DTS) algorithm that treats policy selection as a multi‑armed bandit problem. Each arm corresponds to one robust policy, and the reward for pulling an arm is defined as the Wasserstein‑1 distance between the value‑function distribution observed after applying that policy and a reference distribution. By discounting past observations, DTS quickly adapts to sudden shifts in the environment, such as high‑intensity GPS spoofing, while still leveraging historical performance when the environment is stable.
The authors first construct a diverse set of action‑robust policies using an adversarial training scheme based on a two‑player zero‑sum Markov game. The agent and an adversary jointly train a shared critic while the agent’s policy is mixed with the adversary’s actions according to a parameter α. Varying α across a range yields policies that are robust to different levels of adversarial force. Training stops when the entropy gap between exploration and exploitation falls below a threshold, guaranteeing that each policy in the ensemble has converged to a distinct robustness profile.
Theoretical analysis shows that DTS achieves an expected regret bounded by O(√(KT log T)), where K is the number of policies and T the time horizon. Moreover, the authors prove that minimizing regret also minimizes the expected shift in the value‑function distribution, which they link to an “antifragile” behavior: as adversarial intensity grows, the average performance of the meta‑policy can improve rather than degrade, provided the ensemble contains complementary policies.
Empirical evaluation is conducted in a 3‑D obstacle‑rich environment simulated with Interfered Flow Dynamics System (IFDS) to generate realistic disturbance fields. Two attack scenarios are considered: (1) a white‑box projected gradient descent (PGD) attack that perturbs the policy network’s inputs, and (2) a black‑box GPS spoofing attack that creates large observation shifts. The meta‑policy is compared against (a) a vanilla DDPG agent, (b) a robust RL agent trained on a single perturbation bound, (c) gradient‑based meta‑RL, and (d) simple ensemble averaging or UCB‑based selectors.
Results demonstrate that the DTS‑driven meta‑policy reduces average path length by 15‑20 % and raises conflict‑free trajectory rates by 10‑18 % relative to baselines. In high‑intensity GPS spoofing, where the robust RL baseline’s success rate drops below 70 %, the proposed method maintains over 92 % success. DTS also yields lower cumulative regret and fewer unnecessary switches compared with ε‑greedy or standard Thompson Sampling, confirming its suitability for real‑time UAV operation.
The paper acknowledges limitations: the simulation does not model sensor noise, communication latency, or actuator dynamics, and the computational cost of DTS scales linearly with the number of policies. Future work will incorporate realistic uncertainties, explore variational Bayesian updates to reduce overhead, and validate the approach on hardware‑in‑the‑loop testbeds and multi‑UAV cooperative deconfliction scenarios. Overall, the work introduces a novel, theoretically grounded, and empirically validated method for adaptive robustness and antifragility in adversarial UAV navigation.
Comments & Academic Discussion
Loading comments...
Leave a Comment