A three-dimensional domain decomposition method for large-scale DFT electronic structure calculations

With tens of petaflops supercomputers already in operation and exaflops machines expected to appear within the next 10 years, efficient parallel computational methods are required to take advantage of such extreme-scale machines. In this paper, we present a three-dimensional domain decomposition scheme for enabling large-scale electronic calculations based on density functional theory (DFT) on massively parallel computers. It is composed of two methods: (i) atom decomposition method and (ii) grid decomposition method. In the former, we develop a modified recursive bisection method based on inertia tensor moment to reorder the atoms along a principal axis so that atoms that are close in real space are also close on the axis to ensure data locality. The atoms are then divided into sub-domains depending on their projections onto the principal axis in a balanced way among the processes. In the latter, we define four data structures for the partitioning of grids that are carefully constructed to make data locality consistent with that of the clustered atoms for minimizing data communications between the processes. We also propose a decomposition method for solving the Poisson equation using three-dimensional FFT in Hartree potential calculation, which is shown to be better than a previously proposed parallelization method based on a two-dimensional decomposition in terms of communication efficiency. For evaluation, we perform benchmark calculations with our open-source DFT code, OpenMX, paying particular attention to the O(N) Krylov subspace method. The results show that our scheme exhibits good strong and weak scaling properties, with the parallel efficiency at 131,072 cores being 67.7% compared to the baseline of 16,384 cores with 131,072 diamond atoms on the K computer.

💡 Research Summary

The paper addresses the pressing need for scalable parallel algorithms that can exploit the capabilities of today’s petascale supercomputers and the forthcoming exascale machines in the context of density‑functional‑theory (DFT) electronic‑structure calculations. The authors propose a comprehensive three‑dimensional (3‑D) domain‑decomposition framework that consists of two tightly coupled components: (i) an atom‑decomposition scheme and (ii) a grid‑decomposition scheme.

In the atom‑decomposition stage the authors introduce a modified recursive bisection algorithm that is driven by the inertia‑tensor of the atomic configuration. By diagonalising the inertia tensor they obtain a principal axis, project all atomic coordinates onto this axis, and then recursively split the projected list so that each sub‑domain contains a balanced number of atoms. Because atoms that are close in real space tend to have nearby projections, the resulting ordering preserves spatial locality on the one‑dimensional ordering. This ordering is then used to assign atoms to MPI processes in a way that minimizes the surface‑to‑volume ratio of each sub‑domain and ensures a roughly equal computational load even for highly irregular atomic distributions (e.g., surfaces, defects, nanostructures).

The grid‑decomposition component is built around four carefully designed data structures: (1) a local block that stores the portion of the real‑space grid owned by a process, (2) a halo region that contains neighboring grid points needed for stencil operations, (3) a grid‑atom correspondence table that maps each grid point to the atom cluster that primarily influences it, and (4) a global index map that enables fast translation between local and global coordinates. These structures guarantee that the data layout of the grid mirrors the atom clustering obtained in the first stage, thereby reducing the amount of inter‑process communication required during charge‑density construction, Hamiltonian application, and potential evaluation.

A major bottleneck in large‑scale DFT is the solution of the Poisson equation for the Hartree potential. Traditional parallel implementations rely on a two‑dimensional (2‑D) decomposition of the three‑dimensional Fast Fourier Transform (FFT), which leads to an imbalance of communication volume as the number of processes grows. The authors replace this with a true 3‑D FFT decomposition. By distributing the grid evenly along all three Cartesian directions, each process handles a cubic sub‑volume of the total grid, and the transpose operations required by the FFT are performed with a minimal number of all‑to‑all communications. The paper demonstrates analytically and experimentally that the 3‑D decomposition reduces the communication cost by roughly a factor of two compared with the 2‑D scheme, especially at very high core counts.

The proposed framework is implemented in the open‑source DFT package OpenMX and is tested with the O(N) Krylov‑subspace method, which already reduces the scaling of the Hamiltonian‑vector product. Benchmark calculations are performed on the K computer (Japan) using up to 131 072 cores. The test system consists of 131 072 diamond‑structure carbon atoms, a problem size that would be prohibitive without an efficient domain‑decomposition strategy. Strong‑scaling results show that when the baseline efficiency at 16 384 cores is defined as 100 %, the efficiency at 131 072 cores remains at 67.7 %. Weak‑scaling tests, where the number of atoms per core is kept constant while the total number of cores is increased, exhibit near‑linear speed‑up, confirming that the algorithm maintains a balanced workload and low communication overhead as the problem size grows.

The paper also discusses implementation details that are crucial for achieving the reported performance. The inertia‑tensor calculation and recursive bisection have O(N) and O(log P) complexity, respectively, where N is the number of atoms and P the number of processes. Halo exchanges are overlapped with local computations using non‑blocking MPI calls, and the 3‑D FFT is realized with the PFFT library, which provides optimized all‑to‑all transposes. The authors make the source code publicly available, facilitating reproducibility and further development by the community.

Limitations are acknowledged. The current implementation focuses on plane‑wave and pseudo‑atomic‑orbital (PAO) bases; extensions to time‑dependent DFT, GW, or hybrid functional calculations are not covered. GPU acceleration is also absent; adapting the recursive bisection and 3‑D FFT to CUDA or HIP would be a natural next step for exascale readiness.

In summary, the work delivers a robust, scalable, and publicly available domain‑decomposition methodology that tightly couples atom‑centric load balancing with grid‑centric communication minimization. By demonstrating strong and weak scaling up to 131 072 cores with a realistic DFT workload, the authors provide convincing evidence that large‑scale electronic‑structure simulations can efficiently exploit present‑day petascale systems and are well positioned for the imminent exascale era.