Learning Obstacle Avoidance using Double DQN for Quadcopter Navigation

Learning Obstacle Avoidance using Double DQN for Quadcopter Navigation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

One of the challenges faced by Autonomous Aerial Vehicles is reliable navigation through urban environments. Factors like reduction in precision of Global Positioning System (GPS), narrow spaces and dynamically moving obstacles make the path planning of an aerial robot a complicated task. One of the skills required for the agent to effectively navigate through such an environment is to develop an ability to avoid collisions using information from onboard depth sensors. In this paper, we propose Reinforcement Learning of a virtual quadcopter robot agent equipped with a Depth Camera to navigate through a simulated urban environment.


💡 Research Summary

The paper addresses the problem of autonomous quadcopter navigation in dense urban environments where GPS accuracy degrades and dynamic obstacles such as pedestrians, vehicles, and temporary structures are common. The authors propose a reinforcement‑learning solution that relies solely on a forward‑facing depth camera, avoiding the pitfalls of RGB imagery under varying lighting conditions. The state representation consists of an 84 × 84 depth‑perspective image captured at each time step, which is fed into a convolutional neural network (CNN) that serves as the function approximator for a Double Deep Q‑Network (Double DQN).

The action space is deliberately kept discrete: the vehicle moves forward at a constant speed while its yaw rate can be selected from five values (‑10°, ‑5°, 0°, 5°, 10°). This design simplifies the learning problem while still allowing the agent to steer around obstacles. Double DQN is employed to mitigate the over‑estimation bias inherent in standard DQN. Two independent Q‑networks (Q₁ and Q₂) are trained in parallel; one network selects the greedy action, the other evaluates its value, and the networks are randomly swapped during updates. An experience replay buffer stores past transitions, enabling off‑policy learning and breaking temporal correlations.

The reward function combines a dense component—proportional to the Euclidean distance to the goal and the lateral deviation from the straight line connecting start and goal—with sparse penalties for collisions, excessive deviation from the planned corridor, and exceeding a time‑step budget. The episode length limit is increased linearly with the episode count, encouraging early exploration while gradually allowing longer trajectories as the policy improves.

Experiments are conducted in Microsoft AirSim, an open‑source simulator built on Unreal Engine that provides realistic physics and sensor models. Two custom training arenas are created: (1) “Blocks,” a rectangular space populated with movable cuboid blocks that emulate simple building structures; and (2) “Wobles,” a more complex layout containing cylindrical pillars, short walls, sharp‑turn zones, and a densely packed region (Zone D) for stress‑testing. Training proceeds in stages. Initially, the agent learns a primitive collision‑avoidance policy without a goal, using only the collision‑based penalty. Once a baseline policy is established, a goal location is introduced and the agent must both avoid obstacles and minimize travel time.

Results show that during the first few hundred episodes the agent frequently collides, reflected by sharp drops in average reward. As training progresses, the average reward rises steadily, and the collision rate declines. In the Blocks arena, performance is strongly tied to the field of view of the depth camera; when obstacles approach from angles outside the central view, the policy sometimes fails to anticipate them, leading to late evasive maneuvers. In the Wobles arena, the primitive policy learned in the Blocks environment serves as a useful initialization, accelerating convergence when the agent is later exposed to the more challenging Zone D. However, the paper also notes that the discrete yaw‑rate set limits the ability to perform fine‑grained maneuvers, especially in tight corridors.

The authors acknowledge several limitations. First, all experiments are confined to simulation; real‑world deployment would introduce sensor noise, aerodynamic disturbances, and communication latency that could degrade performance. Second, the action space does not include vertical motion or variable forward speed, restricting the policy’s expressiveness for three‑dimensional avoidance. Third, the reward design focuses on distance and path deviation, omitting considerations such as energy consumption, flight time minimization, or safety margins, which are critical for practical missions.

Future work is suggested along four main directions: (i) transferring the learned policy to physical drones using domain randomization or sim‑to‑real adaptation techniques; (ii) exploring continuous‑action algorithms such as Deep Deterministic Policy Gradient (DDPG) or Soft Actor‑Critic (SAC) to enable richer control commands; (iii) incorporating multimodal perception (e.g., RGB‑Depth fusion) to improve robustness against lighting changes and occlusions; and (iv) extending the reward function to multi‑objective criteria, thereby encouraging energy‑efficient and time‑optimal trajectories. By addressing these points, the approach could evolve from a proof‑of‑concept in a simulated cityscape to a viable solution for real‑world autonomous aerial navigation.


Comments & Academic Discussion

Loading comments...

Leave a Comment