UAV Trajectory Optimization via Improved Noisy Deep Q-Network

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper proposes an Improved Noisy Deep Q-Network (Noisy DQN) to enhance the exploration and stability of Unmanned Aerial Vehicle (UAV) when applying deep reinforcement learning in simulated environments. This method enhances the exploration ability by combining the residual NoisyLinear layer with an adaptive noise scheduling mechanism, while improving training stability through smooth loss and soft target network updates. Experiments show that the proposed model achieves faster convergence and up to $+40$ higher rewards compared to standard DQN and quickly reach to the minimum number of steps required for the task 28 in the 15 * 15 grid navigation environment set up. The results show that our comprehensive improvements to the network structure of NoisyNet, exploration control, and training stability contribute to enhancing the efficiency and reliability of deep Q-learning.

💡 Research Summary

**
The paper addresses the problem of autonomous trajectory planning for unmanned aerial vehicles (UAVs) operating under communication constraints. The authors propose an “Improved Noisy Deep Q‑Network” (Improved Noisy DQN) that augments the standard NoisyNet architecture with residual NoisyLinear blocks, an adaptive noise‑scaling schedule, and training‑stability enhancements such as Double DQN target estimation and soft target updates.

Problem formulation
A 15 × 15 grid world is used to emulate a realistic environment. The UAV starts at (0,0) and must reach the goal at (14,14) while avoiding a mixture of structured and random obstacles. Signal strength from a base station decays with the square of the Euclidean distance and is further attenuated by obstacles; the UAV must keep the signal above a minimum threshold. The authors formalize this as a multi‑objective discrete optimization problem with constraints on collision avoidance, maximum step count, minimum distance to obstacles, and minimum signal strength.

Reinforcement‑learning setup
The state vector includes the UAV’s current coordinates, the positions of all obstacles, and the instantaneous signal strength. The action space consists of five discrete moves (up, right, down, left, hover). A composite reward function combines seven weighted terms: distance to goal, step‑wise movement cost, remaining time, proximity to obstacles, a bias toward the goal, a penalty for violating the minimum obstacle distance, and a reward for maintaining sufficient signal strength. The weights sum to one and are tuned experimentally. Episodes terminate upon reaching the goal, exceeding the step limit, or colliding with an obstacle.

Network architecture
The Q‑network comprises a two‑layer multilayer perceptron (MLP) feature extractor followed by two residual NoisyLinear layers and a final value head. Each NoisyLinear layer replaces the deterministic linear transformation with a stochastic one:

(y = (\mu_w + \alpha \sigma_w \odot \epsilon_w) x + (\mu_b + \alpha \sigma_b \odot \epsilon_b))

where (\mu) and (\sigma) are learnable means and standard deviations, (\epsilon) is sampled from a standard normal distribution, and (\alpha) is a global noise‑scale factor. Factorized Gaussian noise is employed to reduce computational overhead. The residual connections ensure that the injected noise propagates effectively through depth, mitigating the vanishing‑exploration problem observed in deeper NoisyNet variants.

Adaptive noise scheduling
Instead of a fixed decay, the authors introduce a performance‑aware schedule:

(\alpha(n) = \alpha_{\min} + (\alpha_{\max} - \alpha_{\min}) \cdot (1 - n/T_{\text{decay}}) \cdot g(P_n))

where (P_n) is the recent success rate and (g(\cdot)) adjusts the decay speed based on whether the success rate is low, high, or moderate. Additionally, noise is re‑sampled every (k) steps to encourage policy diversity.

Stability mechanisms
To curb overestimation bias, Double DQN is used: the action that maximizes the current Q‑network is selected, but the target Q‑value is obtained from a separate target network. The target network is updated softly with a mixing factor (\tau \ll 1), which smooths the learning target and reduces oscillations. The loss is the mean‑squared error (or Huber loss) between the predicted Q‑value and the Double‑DQN target.

Experimental evaluation
The authors compare Improved Noisy DQN against standard DQN, Double DQN, DQN with Prioritized Experience Replay (PER), and the original Noisy DQN. All methods are trained for the same number of episodes in the 15 × 15 grid with identical obstacle layouts and signal‑attenuation parameters. Results show:

Faster convergence – the proposed method reaches a stable loss within the first 10 % of episodes, whereas baselines require 2–3 times more episodes.
Higher cumulative reward – up to a 40 % increase over the original Noisy DQN and even larger gains versus non‑noisy baselines.
Shorter trajectories – the average number of steps to reach the goal converges to 28.3, close to the theoretical minimum, while baselines hover around 35–42 steps.
Better compliance with communication constraints – the signal‑strength constraint is satisfied in >95 % of episodes, outperforming all baselines.

Contributions and limitations
The paper’s primary contributions are: (1) a novel residual NoisyLinear architecture that preserves exploratory noise across depth, (2) an adaptive, performance‑driven noise‑scaling schedule that balances exploration and exploitation, and (3) the integration of Double DQN and soft target updates to improve training stability in a multi‑objective UAV navigation task.

However, the study is limited to a 2‑D grid simulation; real‑world UAV flight involves 3‑D dynamics, wind disturbances, and more complex radio propagation models. The adaptive noise schedule introduces several hyper‑parameters ((\alpha_{\min}, \alpha_{\max}, T_{\text{decay}}, k)) whose sensitivity is not fully explored, raising concerns about generalizability to other domains. Future work should extend the approach to multi‑UAV coordination, incorporate realistic channel models, and validate the algorithm on physical drone platforms.

Overall, the Improved Noisy DQN presents a compelling blend of architectural innovation and adaptive exploration control, demonstrating measurable gains in a challenging, communication‑aware navigation scenario.

UAV Trajectory Optimization via Improved Noisy Deep Q-Network

💡 Research Summary

Comments & Academic Discussion

Leave a Comment