General-purpose molecular dynamics simulations on GPU-based clusters

General-purpose molecular dynamics simulations on GPU-based clusters
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a GPU implementation of LAMMPS, a widely-used parallel molecular dynamics (MD) software package, and show 5x to 13x single node speedups versus the CPU-only version of LAMMPS. This new CUDA package for LAMMPS also enables multi-GPU simulation on hybrid heterogeneous clusters, using MPI for inter-node communication, CUDA kernels on the GPU for all methods working with particle data, and standard LAMMPS C++ code for CPU execution. Cell and neighbor list approaches are compared for best performance on GPUs, with thread-per-atom and block-per-atom neighbor list variants showing best performance at low and high neighbor counts, respectively. Computational performance results of GPU-enabled LAMMPS are presented for a variety of materials classes (e.g. biomolecules, polymers, metals, semiconductors), along with a speed comparison versus other available GPU-enabled MD software. Finally, we show strong and weak scaling performance on a CPU/GPU cluster using up to 128 dual GPU nodes.


💡 Research Summary

This paper presents a CUDA‑based implementation of LAMMPS, a widely used parallel molecular dynamics (MD) engine, and demonstrates substantial performance gains—ranging from fivefold to thirteenfold—over the CPU‑only version on a single node. The authors packaged the GPU acceleration as a separate “USER‑CUDA” module that can be built alongside the standard LAMMPS source, preserving the full feature set of the original code while requiring only a minimal change to input scripts (adding the line “accelerator cuda”). The implementation follows a hybrid architecture: inter‑node communication is handled by MPI, all particle‑centric calculations (pair forces, long‑range Coulomb via PPPM, bond/angle/dihedral/improper forces, and many LAMMPS “fix” operations) are executed on the GPU via CUDA kernels, and the remaining control logic stays in the existing C++ code running on the CPU.

Key design objectives were: (i) retain LAMMPS’s rich functionality, (ii) achieve the highest possible speed‑up, (iii) enable good parallel scalability on large GPU clusters, (iv) minimize code modifications, (v) keep the code maintainable, (vi) support the full list of LAMMPS capabilities on the GPU, and (vii) make GPU usage transparent to end‑users. To meet these goals, the authors minimized host‑GPU data transfers by keeping particle positions, velocities, and forces resident on the device for as much of the simulation as possible, and they leveraged the GPU’s massive parallelism (thousands of threads) while accounting for its relatively low memory‑bandwidth‑to‑compute ratio, high latency, and serialized random memory accesses.

Two principal strategies for short‑range force evaluation were investigated: a cell‑list approach and a neighbor‑list approach. In the cell‑list method, the simulation box is divided into a regular grid of sub‑cells; each cell is assigned to a CUDA thread block, and 32‑multiple atoms per cell are enforced to keep warps fully occupied. Newton’s third law is used to halve the number of pairwise calculations, and cells are grouped into non‑interfering sets (six groups in 2‑D, eighteen in 3‑D) to avoid write conflicts. Shared memory is employed to cache particle positions of neighboring cells, reducing global memory traffic.

For the neighbor‑list method, the authors implemented two variants: Thread‑per‑Atom (TpA) and Block‑per‑Atom (BpA). TpA assigns one thread to each atom, looping over all its neighbors; BpA assigns an entire thread block to a single atom, with each thread handling a subset of its neighbor list. Both algorithms have comparable instruction counts, but BpA incurs additional reduction overhead. Crucially, BpA benefits from better cache locality because fewer atoms are active simultaneously, allowing the texture cache (or global cache on newer architectures) to be used more effectively. Empirical tests on Lennard‑Jones systems showed that TpA outperforms BpA for short cut‑off distances (e.g., 2.5 σ), whereas BpA becomes faster as the cut‑off grows (e.g., 5.0 σ). Because the crossover point depends on the specific pair potential, hardware, and neighbor count, the implementation includes a lightweight benchmark that runs at simulation startup to automatically select the faster variant.

Performance evaluation covered both single‑GPU and multi‑GPU scenarios. On a single GPU, the CUDA‑LAMMPS package supports 26 pair styles, PPPM long‑range electrostatics, NVE/NVT/NPT integrators, and a broad set of fixes, all executable in single, double, or mixed precision. Benchmarks on metals (Al), semiconductors (Si), polymers (PE), and biomolecules (proteins) with atom counts ranging from 10⁶ to 10⁸ demonstrated speed‑ups of 5–13× relative to the CPU version, with the highest gains observed for large, highly connected systems where the neighbor list is long.

For scalability, the authors ran strong‑scaling (fixed problem size) and weak‑scaling (problem size proportional to GPU count) tests on a heterogeneous cluster comprising up to 128 dual‑GPU nodes (256 GPUs total). Using MPI for inter‑node communication and overlapping GPU computation with data exchange, they achieved >80 % parallel efficiency even at the largest scale, indicating that communication overhead remained a small fraction of total runtime.

The paper also compares CUDA‑LAMMPS with other GPU‑accelerated MD codes such as HOOMD, AceMD, and the earlier “GPU” package of LAMMPS. While those codes often focus on a limited set of force fields or lack full LAMMPS compatibility, CUDA‑LAMMPS retains the entire LAMMPS feature set, allowing users to combine multiple potentials, complex bonded interactions, and advanced fixes without code modification. In head‑to‑head benchmarks on identical hardware, CUDA‑LAMMPS matches or slightly exceeds the performance of competing packages, while offering far greater flexibility.

In summary, this work delivers a practical, high‑performance solution for running general‑purpose MD simulations on modern GPU‑based supercomputers. By preserving LAMMPS’s extensive functionality, providing automatic algorithm selection, and demonstrating strong scalability up to hundreds of GPUs, the authors make a compelling case that GPU acceleration can become the default path for large‑scale MD research. Future GPU architectures (e.g., NVIDIA Ampere, Hopper, or AMD CDNA) are expected to further amplify these gains, and the modular design of the USER‑CUDA package should facilitate straightforward adaptation to emerging hardware and new force‑field implementations.


Comments & Academic Discussion

Loading comments...

Leave a Comment