A novel scalable high performance diffusion solver for multiscale cell simulations

A novel scalable high performance diffusion solver for multiscale cell simulations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agent-based cellular models simulate tissue evolution by capturing the behavior of individual cells, their interactions with neighboring cells, and their responses to the surrounding microenvironment. An important challenge in the field is scaling cellular resolution models to real-scale tumor simulations, which is critical for the development of digital twin models of diseases and requires the use of High-Performance Computing (HPC) since every time step involves trillions of operations. We hereby present a scalable HPC solution for the molecular diffusion modeling using an efficient implementation of state-of-the-art Finite Volume Method (FVM) frameworks. The paper systematically evaluates a novel scalable Biological Finite Volume Method (BioFVM) library and presents an extensive performance analysis of the available solutions. Results shows that our HPC proposal reach almost 200x speedup and up to 36% reduction in memory usage over the current state-of-the-art solutions, paving the way to efficiently compute the next generation of biological problems.


💡 Research Summary

This paper addresses a critical bottleneck in large‑scale, agent‑based multiscale cell simulations: the diffusion‑decay step that must be solved at every time step for millions to billions of voxels and multiple substrates. The authors start by reviewing the state of the art. PhysiCell uses the BioFVM library, which implements a Finite Volume Method (FVM) with a Locally One‑Dimensional (LOD) splitting of the three spatial dimensions. Each dimension is solved independently using the Tridiagonal Matrix Algorithm (TDMA, also known as the Thomas algorithm). BioFVM is parallelized with OpenMP for shared‑memory systems, while BioFVM‑X extends it to distributed memory by partitioning the domain along a single axis and using MPI for halo exchanges. However, both implementations suffer from poor memory layout (a vector‑of‑vectors structure) and limited scalability because TDMA is inherently serial for each tridiagonal system, and the communication pattern in BioFVM‑X becomes a bottleneck as the number of MPI processes grows.

The authors propose BioFVM‑B, a hybrid MPI‑OpenMP solution that introduces three major innovations:

  1. Contiguous 4‑D Data Layout – Instead of storing substrate concentrations as a vector of vectors (one vector per voxel), BioFVM‑B allocates a single four‑dimensional array that interleaves the x, y, z coordinates with the substrate index. This layout eliminates pointer indirection, improves cache line utilization, and enables straightforward SIMD vectorization across the y‑z plane and the substrate dimension.

  2. Block‑Based Domain Decomposition and Overlapped Communication – The simulation domain is still split along the x‑axis, but each MPI rank now processes the domain in blocks of contiguous x‑planes. Non‑blocking MPI (Isend/Irecv) is used so that while a block is being computed, the halo data for the next block can be transferred in parallel. The number of blocks is chosen automatically by a simple heuristic: blocks = k * number_of_nodes, where k is a tunable factor that adapts to problem size. This approach reduces the number of messages, hides latency, and balances the workload across ranks.

  3. Vectorized TDMA Solver – The forward and backward sweeps of TDMA are rewritten to exploit both thread‑level parallelism (OpenMP) and data‑level parallelism (SIMD). Because the new data layout stores all substrates for a given y‑z coordinate consecutively, the solver can process many independent tridiagonal systems simultaneously, achieving the highest vector length potential in the x‑direction. The algorithm remains mathematically identical to the original TDMA, preserving first‑order temporal accuracy.

Complexity analysis shows that the total floating‑point work remains O(S·N³) (S = number of substrates, N = voxels per side), but the constant factor is dramatically reduced by the combined effects of better memory bandwidth utilization, reduced communication overhead, and SIMD acceleration.

Performance evaluation was conducted on the Marenostrum 5 supercomputer (AMD EPYC 7763 CPUs, 64 cores per node). Test cases used a cubic domain of 1 k³ voxels with 10 substrates. Scaling experiments from 1 to 256 nodes (64 to 16384 cores) demonstrated:

  • Up to 200× speed‑up compared with the original BioFVM‑X implementation.
  • An average 196× speed‑up over BioFVM‑X across all node counts.
  • 36 % reduction in memory footprint due to the contiguous data structure.

The authors also discuss integration with higher‑level frameworks such as PhysiBoSS (which adds stochastic Boolean networks) and PhysiMess (which models extracellular matrix). BioFVM‑B’s design is compatible with these extensions, enabling full‑scale organ‑level simulations that retain cellular resolution.

In conclusion, BioFVM‑B demonstrates that the diffusion‑decay bottleneck can be effectively eliminated by (i) redesigning data structures for cache‑friendly access, (ii) employing block‑wise overlapped MPI communication, and (iii) fully vectorizing the TDMA solver. These advances make it feasible to run human‑scale digital twin simulations in near‑real‑time, opening the door to personalized medicine applications that require organ‑level, cell‑by‑cell modeling. Future work will explore GPU offloading, adaptive load balancing, and incorporation of nonlinear reaction terms to further broaden the applicability of the solver.


Comments & Academic Discussion

Loading comments...

Leave a Comment