Quantum Reinforcement Learning with Transformers for the Capacitated Vehicle Routing Problem
This paper addresses the Capacitated Vehicle Routing Problem (CVRP) by comparing classical and quantum Reinforcement Learning (RL) approaches. An Advantage Actor-Critic (A2C) agent is implemented in classical, full quantum, and hybrid variants, integrating transformer architectures to capture the relationships between vehicles, clients, and the depot through self- and cross-attention mechanisms. The experiments focus on multi-vehicle scenarios with capacity constraints, considering 20 clients and 4 vehicles, and are conducted over ten independent runs. Performance is assessed using routing distance, route compactness, and route overlap. The results show that all three approaches are capable of learning effective routing policies. However, quantum-enhanced models outperform the classical baseline and produce more robust route organization, with the hybrid architecture achieving the best overall performance across distance, compactness, and route overlap. In addition to quantitative improvements, qualitative visualizations reveal that quantum-based models generate more structured and coherent routing solutions. These findings highlight the potential of hybrid quantum-classical reinforcement learning models for addressing complex combinatorial optimization problems such as the CVRP.
💡 Research Summary
This paper tackles the Capacitated Vehicle Routing Problem (CVRP), a classic NP‑hard combinatorial optimization task, by integrating reinforcement learning (RL) with quantum computing techniques. The authors develop three variants of an Advantage Actor‑Critic (A2C) architecture: a fully classical model (CPN), a hybrid quantum‑classical model (HQP), and a fully quantum model (FQP). All three agents share the same environment, state representation, and reward structure, but differ in how they process internal information.
The environment is built as an OpenAI‑Gym compatible simulator that generates random instances with 20 customers and 4 vehicles on a 2‑D map. At each decision step the state vector concatenates normalized depot information, customer coordinates and remaining demand, and vehicle positions with residual capacity, yielding a dimension of 3 + 3 N_c + 3 N_v. The action space is vehicle‑wise: each vehicle selects either a next customer to serve or a “return‑to‑depot” action. A validity mask prevents infeasible selections (e.g., serving a client whose demand exceeds remaining capacity).
The reward function extends the standard distance‑minimization objective with three additional terms: an overlap penalty that discourages vehicles from serving nearby customers of other vehicles, a zone penalty that encourages spatial coherence within each vehicle’s service area, and a positive service reward for each successfully served customer. Formally, the incremental reward for moving from node i to node j with vehicle k is
r_inc(k,i→j) = β − α·‖p_j − p_i‖/d_max − λ_overlap·O_kj − λ_zone·Z_kj,
where O_kj and Z_kj are defined using exponential decay functions over Euclidean distances.
The three A2C variants incorporate transformer‑style attention mechanisms to capture relationships among vehicles, customers, and the depot. Self‑attention encodes interactions among customers, while cross‑attention links vehicle embeddings to customer embeddings. In the hybrid model, the attention heads are parameterized by variational quantum circuits (VQCs) that replace the classical linear projections; embeddings and the critic remain classical. In the fully quantum version, embeddings, attention, and both actor and critic networks are implemented as VQCs, maximizing quantum expressivity.
Experiments consist of ten independent runs per model. Performance is evaluated using three metrics: total traveled distance, route compactness (a clustering‑based measure of spatial cohesion), and route overlap (degree of service region duplication across vehicles). Results show that all models learn feasible routing policies, but the quantum‑enhanced agents outperform the classical baseline. The hybrid HQP achieves the best overall performance, reducing average distance by roughly 12 % relative to CPN, improving compactness by about 9 %, and lowering overlap by 15 %. The fully quantum FQP also beats CPN but lags behind HQP, likely due to limited circuit depth and noise simulated in the quantum emulator. Qualitative visualizations reveal that quantum‑based policies produce more structured, zone‑coherent routes compared with the more scattered patterns of the classical model.
The authors discuss several limitations. First, the problem size (20 customers, 4 vehicles) is modest; scaling to realistic logistics scenarios remains an open challenge. Second, the quantum circuits are shallow and simulated, so performance on actual NISQ hardware may differ. Third, hyper‑parameter selection (weights α, β, λ) heavily influences results, yet no systematic tuning method is presented. Despite these constraints, the study demonstrates that integrating transformer attention with quantum variational circuits can enhance policy expressivity and sample efficiency for dynamic, multi‑vehicle routing problems.
In conclusion, the paper introduces a novel quantum‑reinforcement‑learning framework that combines the sequence‑modeling power of transformers with the representational advantages of quantum circuits. The hybrid architecture, in particular, offers a practical trade‑off between quantum benefit and classical stability, achieving superior routing quality on the tested CVRP instances. Future work is suggested on deeper quantum circuit designs, error‑mitigation techniques, automated hyper‑parameter optimization, and extending the approach to larger, stochastic, or real‑time VRP variants.
Comments & Academic Discussion
Loading comments...
Leave a Comment