Scaling Behaviors of Evolutionary Algorithms on GPUs: When Does Parallelism Pay Off?
Evolutionary algorithms (EAs) are increasingly implemented on graphics processing units (GPUs) to leverage parallel processing capabilities for enhanced efficiency. However, existing studies largely emphasize the raw speedup obtained by porting individual algorithms from CPUs to GPUs. Consequently, these studies offer limited insight into when and why GPU parallelism fundamentally benefits EAs. To address this gap, we investigate how GPU parallelism alters the behavior of EAs beyond simple acceleration metrics. We conduct a systematic empirical study of 16 representative EAs on 30 benchmark problems. Specifically, we compare CPU and GPU executions across a wide range of problem dimensionalities and population sizes. Our results reveal that the impact of GPU acceleration is highly heterogeneous and depends strongly on algorithmic structure. We further demonstrate that conventional fixed-budget evaluation based on the number of function evaluations (FEs) is inadequate for GPU execution. In contrast, fixed-time evaluation uncovers performance characteristics that are unobservable under small or practically constrained FE budgets, particularly for adaptive and exploration-oriented algorithms. Moreover, we identify distinct scaling regimes in which GPU parallelism is beneficial, saturates, or degrades as problem dimensionality and population size increase. Crucially, we show that large populations enabled by GPUs not only improve hardware utilization but also reveal algorithm-specific convergence and diversity dynamics that are difficult to observe under CPU-constrained settings. Consequently, our findings indicate that GPU parallelism is not strictly an implementation detail, but a pivotal factor that influences how EAs should be evaluated, compared, and designed for modern computing platforms.
💡 Research Summary
This paper investigates how GPU parallelism fundamentally changes the behavior and evaluation of evolutionary algorithms (EAs) beyond mere speed‑up. Sixteen representative single‑ and multi‑objective EAs—including PSO variants, Differential Evolution (DE) and its adaptive forms, Genetic Algorithms, CMA‑ES families, and popular MOEAs such as NSGA‑II, IBEA, and MOEA/D—are benchmarked on thirty numerical and neuro‑evolution problems. Experiments are conducted on a modern NVIDIA GPU and a comparable multi‑core CPU, varying problem dimensionality from 10 to 1,000 and population sizes from 32 to 8,192. Each configuration is run thirty times to obtain reliable statistics.
The results reveal highly heterogeneous acceleration: algorithms with embarrassingly parallel fitness evaluation and vector‑friendly variation operators (e.g., GA, DE) achieve 20–40× speed‑up, while those requiring global operations such as non‑dominated sorting (NSGA‑II, IBEA) see only 3–6× gains due to reduction and synchronization bottlenecks. CMA‑ES variants benefit from GPU‑optimized matrix computations, reaching about 15× acceleration, and their performance further improves with larger populations because matrix work dominates.
A key contribution is the critique of the traditional fixed‑function‑evaluation (FE) budget. On a GPU, a given FE count is processed much faster than on a CPU, making small FE budgets insufficient to expose an algorithm’s exploratory capabilities. By switching to fixed‑wall‑clock evaluations (e.g., 30 min, 1 h), the authors demonstrate that large populations can explore broader regions early on, and adaptive parameter mechanisms (SaDE, J‑ADE) have time to self‑tune, leading to higher final solution quality. This time‑based assessment uncovers performance patterns invisible under conventional FE limits.
Scaling analysis identifies three regimes. In the “gain” regime, increasing dimension or population size raises GPU core utilization above 80 % and yields steadily rising speed‑up. In the “saturation” regime, memory‑bandwidth limits and global synchronization dominate, causing speed‑up to plateau; this typically occurs for dimensions above 300 with moderate populations (256–1,024). In the “degradation” regime, overly large populations (>4,096) or very high dimensions (>1,000) trigger data‑transfer overhead and thread‑scheduling costs, resulting in slower runtimes than the CPU baseline.
Large populations enabled by GPUs also reveal algorithm‑specific convergence and diversity dynamics. For instance, IPOP‑CMA‑ES maintains similar convergence speed when the population doubles but produces a more diverse set of optima, beneficial for multimodal problems. NSGA‑II’s non‑dominated sorting cost grows with population size, yet the resulting Pareto front becomes denser, improving diversity metrics. Such phenomena are absent in CPU‑constrained experiments, highlighting population size as a critical design dimension.
The authors conclude that GPU parallelism is not a peripheral implementation detail but a central factor influencing EA evaluation methodology, algorithmic design, parameter adaptation, and scalability strategies. They advocate for fixed‑time benchmarking as a standard practice and for designing GPU‑native operators that minimize global synchronization. Future work is suggested on dynamic population scaling, hybrid CPU‑GPU cooperation, and memory‑efficient kernels to further exploit GPU capabilities for evolutionary computation.
Comments & Academic Discussion
Loading comments...
Leave a Comment