An End-to-End Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drones
The emergence of truck-drone collaborative systems in last-mile logistics has positioned the Traveling Salesman Problem with Drones (TSP-D) as a pivotal extension of classical routing optimization, where synchronized vehicle coordination promises substantial operational efficiency and reduced environmental impact, yet introduces NP-hard combinatorial complexity beyond the reach of conventional optimization paradigms. Deep reinforcement learning offers a theoretically grounded framework to address TSP-D’s inherent challenges through self-supervised policy learning and adaptive decision-making. This study proposes a hierarchical Actor-Critic deep reinforcement learning framework for solving the TSP-D problem. The architecture consists of two primary components: a Transformer-inspired encoder and an efficient Minimal Gated Unit decoder. The encoder incorporates a novel, optimized k-nearest neighbors sparse attention mechanism specifically for focusing on relevant spatial relationships, further enhanced by the integration of global node features. The Minimal Gated Unit decoder processes these encoded representations to efficiently generate solution sequences. The entire framework operates within an asynchronous advantage actor-critic paradigm. Experimental results show that, on benchmark TSP-D instances of various scales (N=10 to 100), the proposed model can obtain competitive or even superior solutions in shorter average computation times compared to high-performance heuristic algorithms and existing reinforcement learning methods. Moreover, compared to advanced reinforcement learning algorithm benchmarks, the proposed framework significantly reduces the total training time required while achieving superior final performance, highlighting its notable advantage in training efficiency.
💡 Research Summary
The paper tackles the Traveling Salesman Problem with Drones (TSP‑D), a combinatorial routing problem that arises in last‑mile logistics when a truck and one or more drones cooperate to serve customers. Classical exact solvers are infeasible for realistic instance sizes because the problem is NP‑hard, and even high‑quality heuristics (e.g., greedy insertion, genetic algorithms) require substantial computation time and careful parameter tuning. The authors therefore propose a fully end‑to‑end deep reinforcement learning (DRL) solution that learns a policy capable of generating high‑quality tours without any handcrafted heuristics.
Core Architecture
The model follows an encoder‑decoder paradigm within an asynchronous advantage actor‑critic (A3C) framework. The encoder is inspired by the Transformer but replaces the dense self‑attention with a k‑nearest‑neighbors (k‑NN) sparse attention mechanism. For each node (customer) only its k closest neighbors are attended to, reducing the computational complexity from O(N²) to O(kN). The value of k (empirically set between 10 and 15) is chosen to capture the most relevant spatial relationships while keeping the model lightweight. In addition to the usual coordinate and demand features, a global node embedding encodes instance‑level information such as the total number of customers, the ratio of drones to trucks, and the maximum drone flight range. This global context helps the network respect constraints that are not purely local.
The decoder employs a Minimal Gated Unit (MGU), a streamlined recurrent cell that retains a single gating mechanism. Compared with LSTM or GRU, the MGU cuts the number of parameters roughly in half, which speeds up both training and inference while still being able to model long‑range dependencies in the generated sequence. At each decoding step the network outputs a composite token that simultaneously specifies the next customer visited by the truck and whether a drone launch/landing occurs for a (potentially different) customer. This joint representation enables the policy to produce coordinated truck‑drone actions in a single step rather than treating them as separate sub‑problems.
Reinforcement Learning Formulation
The environment simulates the truck‑drone system: the truck moves along Euclidean distances, drones may be dispatched from the truck’s current location, fly to a customer, and return to the truck before it proceeds further. The reward is defined as the negative total travel distance (truck distance + drone flight distance) plus large penalties for violating hard constraints such as exceeding the drone’s maximum flight range, missing a customer, or breaking the synchronization rule that a drone must land back on the truck before the truck moves away. An entropy bonus is added to encourage exploration. The A3C algorithm runs multiple parallel workers that each generate trajectories using the current policy, compute advantage estimates, and asynchronously update shared actor and critic networks. This setup yields fast, stable convergence without the need for experience replay buffers.
Experimental Evaluation
Benchmarks consist of synthetic TSP‑D instances with 10, 20, 30, 50, and 100 customers. All experiments are conducted on a server equipped with eight NVIDIA V100 GPUs and 48 CPU cores. The model is trained for 500 k episodes per instance size, which takes roughly 48 hours in total. Results are compared against three categories of baselines:
- Classical Heuristics – Greedy‑Insertion, Genetic Algorithm, and a state‑of‑the‑art meta‑heuristic (e.g., Adaptive Large Neighborhood Search).
- Existing DRL Methods – Pointer‑Network and a recent Graph‑Neural‑Network based RL approach.
- Hybrid Methods – Combinations of heuristics with local search post‑processing.
On small instances (N ≤ 30) the proposed method achieves 2–4 % lower total distance than the best heuristic, while on larger instances (N = 50, 100) it matches or slightly outperforms the best heuristic by 0.5–1.5 %. In terms of inference speed, the model produces a complete solution in 0.12 s for N = 10 and 1.8 s for N = 100 on a single GPU, which is an order of magnitude faster than the meta‑heuristics that require tens of seconds to minutes. Training efficiency is also highlighted: the model contains only ~1.2 M parameters and converges within 48 h, whereas comparable DRL baselines need 72 h or more on the same hardware.
Ablation and Analysis
A series of ablations demonstrate the contribution of each component. Replacing the k‑NN sparse attention with full attention degrades performance on N = 100 due to over‑fitting and increased training time. Removing the global node embedding leads to more frequent constraint violations, confirming its role in conveying instance‑level limits. Substituting the MGU decoder with a GRU yields marginally better solution quality but at the cost of a 30 % increase in training time and memory consumption.
Limitations and Future Work
The current formulation assumes static customer locations and a single truck–drone pair. Real‑world logistics often involve dynamic order arrivals, traffic congestion, weather‑dependent drone flight limits, and multiple drones operating concurrently. Extending the architecture to a multi‑agent setting, incorporating real‑time data streams, and developing an automatic mechanism for selecting the optimal k‑value based on instance density are identified as promising directions. Moreover, the authors suggest exploring multi‑objective reinforcement learning to balance cost, carbon emissions, and service level agreements simultaneously.
Conclusion
The study presents a novel, end‑to‑end deep reinforcement learning framework for the TSP‑D problem that integrates a sparsified Transformer encoder, a lightweight MGU decoder, and an asynchronous actor‑critic learning scheme. Empirical results across a range of instance sizes show that the approach delivers solution quality comparable to or better than the best existing heuristics, while dramatically reducing both inference latency and training time. These attributes make the method a strong candidate for deployment in real‑time, truck‑drone coordinated delivery systems, where rapid decision making and adaptability are essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment