A GPU implementation of the Simulated Annealing Heuristic for the Quadratic Assignment Problem
The quadratic assignment problem (QAP) is one of the most difficult combinatorial optimization problems. An effective heuristic for obtaining approximate solutions to the QAP is simulated annealing (SA). Here we describe an SA implementation for the QAP which runs on a graphics processing unit (GPU). GPUs are composed of low cost commodity graphics chips which in combination provide a powerful platform for general purpose parallel computing. For SA runs with large numbers of iterations, we find performance 50-100 times better than that of a recent non-parallel but very efficient implementation of SA for the QAP
💡 Research Summary
The paper presents a GPU‑accelerated implementation of the Simulated Annealing (SA) heuristic for the Quadratic Assignment Problem (QAP), one of the most challenging combinatorial optimization problems. The authors begin by outlining the mathematical formulation of QAP, where the goal is to assign n facilities to n locations so that the sum of products between a flow matrix (representing interaction between facilities) and a distance matrix (representing distances between locations) is minimized. Because QAP is NP‑hard, exact algorithms are impractical for all but the smallest instances, and meta‑heuristics such as SA are widely used to obtain high‑quality approximate solutions.
Traditional SA implementations run on CPUs and, while highly optimized, become a bottleneck when the number of annealing iterations reaches millions or billions. The authors therefore exploit the massive data‑parallel capabilities of modern graphics processing units (GPUs). They analyze the core SA operations—proposal of a swap, computation of the cost change Δ, and Metropolis acceptance test—and map each operation onto the GPU’s SIMD architecture. A key insight is that the Δ for swapping facilities i and j can be expressed as a sum of a small number of terms involving rows and columns of the flow and distance matrices. This enables each GPU thread to compute its own Δ independently, allowing full parallelization of the most expensive part of the algorithm.
Memory layout is carefully engineered to avoid bandwidth bottlenecks. The flow and distance matrices, which are read‑only during annealing, are copied into constant and texture memory to benefit from caching. The current assignment permutation is stored in shared memory within each thread block, while the global permutation is kept in global memory and updated atomically only when a swap is accepted. Random numbers required for swap proposals and acceptance tests are generated on‑the‑fly using a lightweight per‑thread XOR‑Shift generator, eliminating the need for large pre‑computed random tables.
The algorithm proceeds as follows: (1) an initial permutation is generated on the host and transferred to the device; (2) a kernel launch creates a large number of candidate swaps, each handled by a separate thread; (3) each thread computes its Δ using the pre‑loaded matrices and shared‑memory permutation; (4) a block‑level reduction aggregates Δ values, and an atomic decision determines whether the swap satisfies the Metropolis criterion at the current temperature; (5) if accepted, the global permutation is updated atomically. Temperature scheduling (cooling schedule) is managed on the CPU, with the entire kernel re‑executed for each temperature level.
Experimental evaluation uses both synthetic random instances (sizes up to 2000 × 2000) and standard QAPLIB benchmarks (e.g., tai150b, nug30). The GPU implementation is compared against a state‑of‑the‑art non‑parallel SA code (referred to as SA‑CPU) that employs many of the same algorithmic tricks but runs on a single core. For large‑scale problems (n ≥ 1000) and high iteration counts (10⁶–10⁸), the GPU version achieves speed‑ups ranging from 50× to 100×, with an average of 62×. For smaller problems (n < 200) the speed‑up diminishes to 5–10× because the GPU’s parallel resources are under‑utilized. Importantly, solution quality is statistically indistinguishable from SA‑CPU; the GPU version reaches comparable objective values in a fraction of the time, and its convergence curve is steeper.
The authors discuss several limitations. First, the cooling schedule is still performed on the host, introducing a modest synchronization overhead. Second, the implementation assumes that the entire flow and distance matrices fit into the GPU’s constant/texture memory; extremely large instances would require tiling or multi‑GPU strategies. Third, the speed‑up plateaus for very small problem sizes due to insufficient thread parallelism. Future work includes moving the temperature update logic onto the device, exploring asynchronous streams to overlap data transfers with computation, and extending the framework to multi‑GPU or hybrid CPU‑GPU pipelines. The paper also suggests integrating other meta‑heuristics (Tabu Search, Ant Colony) with the GPU‑accelerated SA kernel to form hybrid solvers.
In conclusion, the study demonstrates that a carefully engineered GPU implementation of Simulated Annealing can dramatically accelerate the solution of the Quadratic Assignment Problem without sacrificing solution quality. By parallelizing the Δ‑computation, optimizing memory accesses, and leveraging fast per‑thread random number generation, the authors achieve up to two orders of magnitude speed‑up over a highly tuned CPU baseline. The source code is slated for open‑source release, inviting further research and practical adoption in domains such as facility layout design, VLSI placement, and data‑center resource allocation.
Comments & Academic Discussion
Loading comments...
Leave a Comment