More Bang for Your Buck: Improved use of GPU Nodes for GROMACS 2018

More Bang for Your Buck: Improved use of GPU Nodes for GROMACS 2018
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We identify hardware that is optimal to produce molecular dynamics trajectories on Linux compute clusters with the GROMACS 2018 simulation package. Therefore, we benchmark the GROMACS performance on a diverse set of compute nodes and relate it to the costs of the nodes, which may include their lifetime costs for energy and cooling. In agreement with our earlier investigation using GROMACS 4.6 on hardware of 2014, the performance to price ratio of consumer GPU nodes is considerably higher than that of CPU nodes. However, with GROMACS 2018, the optimal CPU to GPU processing power balance has shifted even more towards the GPU. Hence, nodes optimized for GROMACS 2018 and later versions enable a significantly higher performance to price ratio than nodes optimized for older GROMACS versions. Moreover, the shift towards GPU processing allows to cheaply upgrade old nodes with recent GPUs, yielding essentially the same performance as comparable brand-new hardware.


💡 Research Summary

The paper “More Bang for Your Buck: Improved use of GPU Nodes for GROMACS 2018” investigates which hardware configurations deliver the highest molecular‑dynamics (MD) throughput per unit cost when running the GROMACS 2018 simulation engine. The authors benchmark a wide variety of CPU‑GPU combinations on two realistic biomolecular test systems – an 80 k‑atom membrane protein (MEM) and a 2 M‑atom ribosome (RIB) – using standard MD parameters (2 fs timestep, 1 nm cut‑off, PME mesh spacing 0.12 nm). For each node they exhaustively scan MPI rank count, OpenMP thread count, and the number of dedicated PME ranks to find the optimal parallel‑execution settings.

Key hardware tested includes legacy CPUs (e.g., Intel Xeon E5‑2680 v2) and recent Xeon Scalable processors, as well as a spectrum of GPUs ranging from older consumer cards (GTX 680, GTX 980) to modern Pascal and Turing GPUs (GTX 1080, RTX 2070/2080/2080 Ti) and professional accelerators (Tesla K40c, V100, Quadro series). Prices, power draw, and thermal design power (TDP) are taken into account, and total cost of ownership (TCO) is calculated by adding purchase price to projected annual electricity and cooling expenses.

The study finds that consumer‑grade GPUs dramatically outperform both professional GPUs and CPU‑only nodes in the performance‑to‑price (P/P) metric. Modern Turing‑based RTX 2080 Ti, despite a lower double‑precision capability, delivers 2–3× higher P/P than a Tesla V100 when used for GROMACS 2018, largely because GROMACS relies on single‑precision, compute‑bound kernels (non‑bonded short‑range forces and PME). The authors show that GPU architectural improvements between 2014 and 2018 (12 nm vs. 28 nm processes, higher memory bandwidth, more CUDA cores) translate into 4–6× speed‑ups for the off‑loaded kernels, outpacing the modest gains seen in CPUs despite the introduction of AVX‑512.

A crucial insight is the shift in the optimal CPU‑to‑GPU balance. In GROMACS 2018 the most efficient configuration is roughly one GPU per CPU core (or one GPU per two hyper‑threads). Adding a second GPU to a node that already has a single GPU yields diminishing returns unless the CPU side is also scaled, because the CPU becomes the bottleneck for tasks such as integration, constraints, and data transfer. Consequently, the P/P ratio peaks when each node contains a modest number of strong CPU cores (often 8–12) paired with a single high‑end consumer GPU.

Cost analysis reveals that, for a fixed budget, a cluster built from GPU‑centric nodes can generate 2–3× more nanoseconds per day of simulation than a comparable CPU‑only or professional‑GPU cluster. When electricity and cooling are included, the advantage persists because modern GPUs have a favorable performance‑per‑watt ratio. The authors also demonstrate that retrofitting older servers (e.g., 2014‑era Xeon boxes) with a current RTX 2080 yields almost the same throughput as purchasing a brand‑new GPU‑optimized node, dramatically shortening the pay‑back period.

The paper stresses practical constraints such as rack density (minimum one GPU per rack unit) and interconnect requirements. Since the target workload is many independent or weakly coupled simulations rather than a single large‑scale parallel run, high‑speed interconnects (InfiniBand) are not essential; the “island” model of many modest nodes with fast local GPUs is more cost‑effective.

Finally, the authors provide a set of actionable recommendations for MD practitioners: (1) prioritize the latest consumer GPUs (RTX 2070/2080 series) over professional cards; (2) match CPUs to GPUs at roughly a 1:1 core‑to‑GPU ratio; (3) consider upgrading existing servers with new GPUs before buying new hardware; (4) factor in power and cooling costs when estimating TCO; and (5) design racks to maximize GPU density while keeping networking modest. The study concludes that, with GROMACS 2018’s expanded GPU off‑loading capabilities, GPU‑centric hardware is unequivocally the most bang‑for‑your‑buck solution for high‑throughput biomolecular simulations.


Comments & Academic Discussion

Loading comments...

Leave a Comment