An Adaptative Multi-GPU based Branch-and-Bound. A Case Study: the Flow-Shop Scheduling Problem
Solving exactly Combinatorial Optimization Problems (COPs) using a Branch-and-Bound (B&B) algorithm requires a huge amount of computational resources. Therefore, we recently investigated designing B&B algorithms on top of graphics processing units (GPUs) using a parallel bounding model. The proposed model assumes parallelizing the evaluation of the lower bounds on pools of sub-problems. The results demonstrated that the size of the evaluated pool has a significant impact on the performance of B&B and that it depends strongly on the problem instance being solved. In this paper, we design an adaptative parallel B&B algorithm for solving permutation-based combinatorial optimization problems such as FSP (Flow-shop Scheduling Problem) on GPU accelerators. To do so, we propose a dynamic heuristic for parameter auto-tuning at runtime. Another challenge of this work is to exploit larger degrees of parallelism by using the combined computational power of multiple GPU devices. The approach has been applied to the permutation flow-shop problem. Extensive experiments have been carried out on well-known FSP benchmarks using an Nvidia Tesla S1070 Computing System equipped with two Tesla T10 GPUs. Compared to a CPU-based execution, accelerations up to 105 are achieved for large problem instances.
💡 Research Summary
The paper tackles the computational challenge of solving permutation‑based combinatorial optimization problems, specifically the Flow‑Shop Scheduling Problem (FSP), by redesigning the classic Branch‑and‑Bound (B&B) algorithm for modern GPU accelerators. Traditional B&B suffers from exponential growth of the search tree, making exact solutions infeasible on CPUs for realistic instance sizes. The authors propose a “pool‑based” parallel bounding model in which a batch of sub‑problems (nodes of the B&B tree) is collected and their lower bounds are evaluated concurrently on a GPU. This approach exploits the massive data‑parallel capabilities of CUDA‑enabled devices, turning the lower‑bound computation—normally a bottleneck—into a highly parallel kernel.
A key contribution is the introduction of an adaptive heuristic that automatically tunes the size of the sub‑problem pool at runtime. The heuristic monitors several hardware‑level metrics (SM occupancy, memory bandwidth utilization, average cycles per bound computation) together with algorithmic indicators (current depth of the tree, frequency of global upper‑bound updates). By aggregating these signals into a “load score,” the system decides whether to enlarge or shrink the pool, thereby maintaining high GPU utilization while avoiding excessive memory pressure. This dynamic adjustment is crucial because the optimal pool size varies dramatically across different FSP instances and even during different phases of a single run.
To further increase parallelism, the authors extend the design to a multi‑GPU setting. Using an Nvidia Tesla S1070 system equipped with two Tesla T10 GPUs, they allocate independent work queues to each device and employ a hybrid load‑balancing scheme that combines round‑robin distribution with a lightweight predictive model of upcoming work. Inter‑GPU communication is minimized by leveraging CUDA Peer‑to‑Peer (P2P) memory access, allowing one GPU to read data produced by the other without staging through host memory. The global upper bound (GUB), which drives pruning, is kept consistent across devices through atomic compare‑and‑swap operations; each GPU updates the GUB as soon as it discovers a better feasible solution, and all devices immediately incorporate the new bound into subsequent pruning decisions.
Implementation details include the use of 64‑bit arithmetic inside the bounding kernel to preserve exactness, careful structuring of the permutation data to achieve coalesced memory accesses, and the organization of CUDA streams and events to overlap computation with data transfers while avoiding unnecessary synchronizations.
Experimental evaluation employs the well‑known Taillard benchmark suite, covering problem sizes from 20 × 5 up to 200 × 20 jobs and machines. The CPU baseline is a highly optimized C++ B&B implementation running on an 8‑core Xeon processor. Results show that for small instances the speed‑up is modest (≈30×) because the overhead of kernel launch dominates, but for medium and large instances the adaptive pool strategy yields average speed‑ups of 55×, with a peak of 78× for a 100 × 20 instance. When both GPUs are utilized, the largest benchmark (200 × 20) achieves a remarkable 105× acceleration compared to the CPU, roughly 1.9× faster than using a single GPU alone. Moreover, the adaptive pool mechanism consistently outperforms any static pool configuration by a factor of at least 2.3 on average.
The authors acknowledge limitations: the current design is tuned for two GPUs, and scaling to four or more devices would increase inter‑GPU traffic and synchronization costs, potentially eroding gains. Additionally, the bounding kernel is relatively lightweight for FSP; more complex lower‑bound calculations (e.g., in Traveling Salesman or Vehicle Routing problems) would require further kernel optimization and possibly different memory layouts.
In conclusion, the paper demonstrates that a combination of pool‑based parallel bounding, runtime pool‑size adaptation, and coordinated multi‑GPU execution can transform the performance envelope of exact B&B algorithms for permutation problems. The achieved accelerations—up to two orders of magnitude—make exact solutions tractable for problem sizes previously considered out of reach. Future work is outlined as follows: (1) extending the framework to clusters with many GPUs, (2) integrating automated kernel tuning (e.g., via auto‑tuning or machine‑learning‑guided parameter search), (3) generalizing the approach to non‑permutation combinatorial problems, and (4) exploring reinforcement‑learning‑based policies for more sophisticated dynamic load balancing.
Comments & Academic Discussion
Loading comments...
Leave a Comment