A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem

A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop   Scheduling Problem
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Branch-and-Bound (B&B) algorithms are time intensive tree-based exploration methods for solving to optimality combinatorial optimization problems. In this paper, we investigate the use of GPU computing as a major complementary way to speed up those methods. The focus is put on the bounding mechanism of B&B algorithms, which is the most time consuming part of their exploration process. We propose a parallel B&B algorithm based on a GPU-accelerated bounding model. The proposed approach concentrate on optimizing data access management to further improve the performance of the bounding mechanism which uses large and intermediate data sets that do not completely fit in GPU memory. Extensive experiments of the contribution have been carried out on well known FSP benchmarks using an Nvidia Tesla C2050 GPU card. We compared the obtained performances to a single and a multithreaded CPU-based execution. Accelerations up to x100 are achieved for large problem instances.


💡 Research Summary

The paper addresses the Flow‑Shop Scheduling Problem (FSP), a classic NP‑hard combinatorial optimization task that seeks to minimize the makespan of a set of jobs processed on a series of machines. While exact methods such as Branch‑and‑Bound (B&B) can guarantee optimal solutions, their practical applicability is limited by the exponential growth of the search tree. In particular, the bounding step—where a lower bound on the makespan is computed for each node—consumes the majority (often 70‑80 %) of the total execution time. The authors therefore propose to accelerate this bottleneck by off‑loading the bounding computation to a Graphics Processing Unit (GPU).

The core contribution is a GPU‑accelerated bounding model that restructures the lower‑bound calculation as a highly parallel operation. The algorithm divides the bounding work into two phases: (1) transfer of the cost matrix and the partial permutation data to the GPU, and (2) parallel evaluation of the lower bound for each candidate node using a CUDA kernel. Each kernel block processes a single node, while its threads simultaneously evaluate the contributions of individual job‑machine pairs. To mitigate the limited GPU memory, the authors introduce a streaming mechanism that partitions the full data set into manageable chunks (e.g., 64 MB). Chunks are transferred asynchronously over PCI‑Express while previous chunks are still being processed, forming a pipeline that hides transfer latency. Additionally, frequently accessed portions of the cost matrix are cached in shared memory, ensuring coalesced global memory accesses and reducing latency.

The host CPU retains responsibility for tree management, node selection, and pruning decisions. This separation creates a producer‑consumer pipeline: the CPU generates nodes and decides whether they should be pruned, while the GPU exclusively computes lower bounds. The design avoids the need for complex synchronization and allows both processors to operate concurrently, maximizing overall throughput.

Experimental evaluation uses the well‑known Taillard benchmark suite, covering problem sizes from 20 to 200 jobs and 5 to 20 machines. All tests run on an Nvidia Tesla C2050 GPU (448 CUDA cores, 3 GB GDDR5) and an Intel Xeon E5‑2630 CPU (8 cores, 2.3 GHz). The GPU‑accelerated B&B is compared against a single‑threaded CPU implementation and an OpenMP‑based multi‑threaded CPU version that share the same branching order and lower‑bound function.

Results demonstrate dramatic speedups that increase with problem size. For a 50‑job, 10‑machine instance, the GPU version achieves an average 30× speedup; for a 100‑job, 10‑machine instance, the average speedup rises to 85×, with a peak of 100× on the largest instances tested. The multi‑threaded CPU version, limited by the number of cores, attains at most an 8× improvement. Even when the data set exceeds the GPU’s 2 GB memory capacity, the streaming approach maintains near‑linear performance, confirming the effectiveness of the memory‑management strategy.

The authors acknowledge several limitations. First, the tree‑management and pruning logic remain on the CPU, introducing host‑device communication overhead that becomes noticeable for extremely deep trees. Second, the current GPU memory limits the direct handling of very large instances (e.g., 500+ jobs), suggesting that a multi‑GPU or distributed‑memory extension would be necessary for such scales. Third, the lower‑bound function itself could be refined or replaced with more sophisticated bounds that may further exploit GPU parallelism.

In conclusion, the study validates that GPU acceleration of the bounding operation can substantially reduce the runtime of exact B&B algorithms for FSP, making optimal solutions feasible for problem sizes previously considered intractable on conventional CPUs. Future work is outlined to (i) migrate additional B&B components (such as node generation) to the GPU, (ii) explore multi‑GPU and cluster‑based implementations for massive instances, and (iii) apply the same acceleration principles to other combinatorial problems like job‑shop scheduling, vehicle routing, and knapsack variants.


Comments & Academic Discussion

Loading comments...

Leave a Comment