Accelerating cellular automata simulations using AVX and CUDA

Accelerating cellular automata simulations using AVX and CUDA

We investigated various methods of parallelization of the Frish-Hasslacher-Pomeau (FHP) cellular automata algorithm for modeling fluid flow. These methods include SSE, AVX, and POSIX Threads for central processing units (CPUs) and CUDA for graphics processing units (GPUs). We present implementation details of the FHP algorithm based on AVX/SSE and CUDA technologies. We found that (a) using AVX or SSE is necessary to fully utilize the potential of modern CPUs; (b) CPUs and GPUs are comparable in terms of computational and economic efficiency only if the CPU code uses AVX or SSE instructions; (c) AVX does not offer any substantial improvement relative to SSE.


💡 Research Summary

The paper presents a systematic performance study of the Frish‑Hasslacher‑Pomeau (FHP) cellular automaton, a lattice‑based algorithm used to simulate fluid flow, across several parallel computing paradigms. Three implementation families are examined: (1) SIMD vectorization on CPUs using SSE (128‑bit) and AVX (256‑bit) instruction sets, (2) multi‑core parallelism on CPUs using POSIX threads, and (3) massive parallelism on NVIDIA GPUs using CUDA.

The FHP algorithm updates each lattice node by shifting particles along six possible directions and applying a collision rule that depends only on the local bit pattern. The authors pre‑compute the collision outcomes in a lookup table, then implement the per‑node update as a sequence of bit‑mask, shift, and logical‑or operations. This structure makes the algorithm amenable to data‑parallel execution but also exposes memory bandwidth as a potential bottleneck.

For the SIMD part, the authors load 8‑bit state vectors into __m128i (SSE) or __m256i (AVX) registers, apply a series of intrinsics (_mm_and_si128, _mm_or_si128, _mm_slli_epi16, etc.) to compute the new state, and store the result back to memory. Benchmarks on a modern Intel Xeon processor show that AVX yields only a modest 5 % speed‑up over SSE for the same core, indicating that the larger register width does not translate into proportional performance gains because the memory subsystem limits throughput.

Thread‑level parallelism is achieved by partitioning the lattice rows among a configurable number of POSIX threads. Each thread processes its assigned rows using the SIMD kernels described above, and a barrier synchronizes all threads at the end of each time step. Scaling results on an 8‑core (16‑thread) system demonstrate near‑linear speed‑up, reaching up to a 12× overall acceleration compared with the sequential baseline. The authors attribute this efficiency to the low inter‑thread communication overhead and the ability of the CPU cache hierarchy to serve the streaming memory accesses required by the algorithm.

The CUDA implementation maps each lattice cell to a single GPU thread. A block of 256 threads loads the collision lookup table from global memory into shared memory, synchronizes with __syncthreads(), and then each thread performs the same bit‑wise update as in the CPU versions. Data layout is transformed to a structure‑of‑arrays format to achieve coalesced memory accesses. On a mid‑range NVIDIA GPU, the CUDA version outperforms the AVX/SSE‑enabled 8‑core CPU by roughly 1.2× for a 1024 × 1024 lattice, while consuming comparable power and hardware cost. Consequently, the authors argue that CPUs equipped with SIMD and multi‑core capabilities can be as cost‑effective as GPUs for this class of problems.

From these experiments the paper draws three key conclusions: (a) SIMD vectorization is essential to exploit modern CPU capabilities; however, AVX does not provide a dramatic advantage over SSE because memory bandwidth, not arithmetic throughput, dominates performance. (b) When SIMD is combined with multi‑core threading, CPU implementations can match or even exceed GPU performance in terms of raw speed‑up per dollar, especially for workloads that fit comfortably within the CPU cache hierarchy. (c) CUDA delivers higher absolute throughput but incurs higher development complexity and hardware dependence, and its economic advantage is limited when the CPU code is properly optimized. The authors suggest that these findings generalize to other lattice‑based simulations such as Lattice Boltzmann methods or Ising models, where similar patterns of local updates and memory‑bound computation prevail.