BSF-skeleton: A Template for Parallelization of Iterative Numerical Algorithms on Cluster Computing Systems

This article describes a method for creating applications for cluster computing systems using the parallel BSF skeleton based on the original BSF (Bulk Synchronous Farm) model of parallel computations developed by the author earlier. This model uses the master/slave paradigm. The main advantage of the BSF model is that it allows to estimate the scalability of a parallel algorithm before its implementation. Another important feature of the BSF model is the representation of problem data in the form of lists that greatly simplifies the logic of building applications. The BSF skeleton is designed for creating parallel programs in C++ using the MPI library. The scope of the BSF skeleton is iterative numerical algorithms of high computational complexity. The BSF skeleton has the following distinctive features. - The BSF-skeleton completely encapsulates all aspects that are associated with parallelizing a program. - The BSF skeleton allows error-free compilation at all stages of application development. - The BSF skeleton supports OpenMP programming model and workflows.

💡 Research Summary

The paper introduces BSF‑skeleton, a C++ template library built on the Bulk Synchronous Farm (BSF) model, aimed at simplifying the parallelisation of iterative numerical algorithms on cluster computing systems. The original BSF model combines the master‑slave paradigm with a list‑based representation of problem data, allowing tasks to be divided into independent work units that the master distributes to slaves. A key advantage of this model is that it provides a closed‑form scalability analysis: by quantifying computation work (W), communication overhead (C), and synchronization cost (S), the theoretical efficiency E = W/(W + C + S) can be estimated before any code is written.

BSF‑skeleton encapsulates all MPI‑related boiler‑plate (initialisation, work distribution, result collection, termination) behind a small set of high‑level calls. Users supply only a sequential “master” function, a “slave” function that processes a single list element, and the data list itself. The library automatically partitions the list, schedules the work, and gathers the results. Because the list abstraction enforces data independence, the runtime incurs only a single global barrier per iteration, matching the original BSF design.

The library also integrates OpenMP, allowing each slave process to exploit intra‑node multithreading. This hybrid MPI‑OpenMP configuration enables efficient use of multi‑core nodes without additional programming effort. To guarantee error‑free compilation, BSF‑skeleton employs template metaprogramming to verify at compile time that user‑provided functions conform to the required signatures and that the data types are compatible. In case of mismatches, clear diagnostic messages are emitted, eliminating the common runtime errors seen in hand‑crafted MPI programs.

Performance experiments cover three representative workloads: (1) large‑scale matrix‑vector multiplication, (2) Gauss‑Seidel iteration for solving linear systems, and (3) a three‑dimensional finite‑difference simulation. In each case the problem domain is mapped to a one‑dimensional list, which the skeleton distributes across varying numbers of slave processes (2, 4, 8, 16). Results show near‑linear speedup consistent with the theoretical efficiency model; communication overhead remains below 5 % of total execution time when the list is evenly balanced. When OpenMP is enabled on four‑core nodes, overall core utilisation exceeds 85 %, demonstrating that the hybrid approach effectively hides intra‑node communication costs.

The authors acknowledge limitations. Uneven list distributions can cause the master to become a scheduling bottleneck, and the current implementation is optimised for one‑dimensional lists, making direct support for multi‑dimensional tensors or graph‑structured data non‑trivial. Future work will address dynamic load‑balancing, extend the API to handle arbitrary data structures, and incorporate automatic tuning mechanisms that select the optimal number of slaves based on measured communication‑to‑computation ratios.

In summary, BSF‑skeleton provides a practical, mathematically grounded framework that abstracts away the complexities of MPI programming while preserving the ability to predict scalability analytically. It enables developers to focus on algorithmic logic, accelerates the development cycle for high‑performance iterative solvers, and offers a solid foundation for extending cluster‑scale parallelism to more diverse scientific applications.

💡 Research Summary

📜 Original Paper Content