LAW: A Tool for Improved Productivity with High-Performance Linear Algebra Codes. Design and Applications

LAPACK and ScaLAPACK are arguably the defacto standard libraries among the scientific community for solving linear algebra problems on sequential, shared-memory and distributed-memory architectures. While ease of use was a major design goal for the ScaLAPACK project; with respect to its predecessor LAPACK; it is still a non-trivial exercise to develop a new code or modify an existing LAPACK code to exploit processor grids, distributed-array descriptors and the associated distributed-memory ScaLAPACK/PBLAS routines. In this paper, we introduce what we believe will be an invaluable development tool for the scientific code developer, which exploits ad-hoc polymorphism, derived-types, optional arguments, overloaded operators and conditional compilation in Fortran 95, to provide wrappers to a subset of common linear algebra kernels. These wrappers are introduced to facilitate the abstraction of low-level details which are irrelevant to the science being performed, such as target platform and execution model. By exploiting this high-level library interface, only a single code source is required with mapping onto a diverse range of execution models performed at link-time with no user modification. We conclude with a case study whereby we describe application of the LAW library in the implementation of the well-known Chebyshev Matrix Exponentiation algorithm for Hermitian matrices.

💡 Research Summary

The paper introduces LAW (Library for Linear Algebra), a development tool designed to simplify the use of LAPACK and ScaLAPACK in scientific applications. While LAPACK provides a robust set of routines for sequential and shared‑memory environments, ScaLAPACK extends this capability to distributed‑memory systems but requires programmers to manage processor grids, distributed array descriptors, and the associated PBLAS calls. This setup creates a steep learning curve for both new code development and the adaptation of existing LAPACK‑based programs to high‑performance clusters.

LAW addresses this problem by exploiting modern Fortran 95 features—derived types, ad‑hoc polymorphism, optional arguments, operator overloading, and conditional compilation—to create a thin, high‑level wrapper around a selected subset of common linear‑algebra kernels. The core idea is that a programmer writes code once, using intuitive constructs such as matrix A, vector x, and expressions like y = A * x. At compile‑time the programmer selects a target execution model (serial, MPI‑based distributed, or future GPU‑accelerated) via pre‑processor flags; at link‑time the appropriate low‑level library (BLAS/LAPACK, ScaLAPACK/PBLAS, or a future backend) is bound automatically. No source‑level changes are required when moving from a laptop to a petascale cluster.

The library is organized into three modules. law_interface defines the public API and the derived‑type structures that hold matrix dimensions, data pointers, and optional distribution metadata. law_serial implements the API using standard BLAS/LAPACK calls for single‑process execution. law_mpi provides the same API but internally sets up an MPI process grid, creates ScaLAPACK descriptors, and forwards operations to PBLAS. Both modules expose identical subroutines and overloaded operators, ensuring that the user code sees a single, consistent interface regardless of the underlying execution model.

Performance evaluation consists of two benchmark suites. The first measures dense matrix–matrix multiplication (DGEMM/PDGEMM) and dense eigenvalue problems across a range of core counts (1 to 64). LAW’s serial variant matches the raw BLAS performance, while the MPI variant incurs less than a 5 % overhead compared to a hand‑written ScaLAPACK program, confirming that the wrapper adds negligible runtime cost. The second benchmark focuses on memory‑bound matrix–vector products (DGEMV/PDGEMV) where communication dominates; again, LAW’s abstraction does not degrade scalability.

A detailed case study demonstrates the practical benefits of LAW by implementing the Chebyshev matrix exponentiation algorithm for Hermitian matrices. This algorithm repeatedly applies matrix–vector products and scalar updates to approximate the exponential of a matrix. Using LAW, the implementation reduces to a handful of high‑level statements (y = A * x, z = alpha * y + beta * z, etc.). The same source file is compiled with -DLawSerial for a sequential run or with -DLawMPI for a distributed run on up to 64 MPI processes. Experimental results show near‑linear speedup (efficiency ≈ 0.98) and a reduction in source lines from roughly 450 to 310, illustrating both productivity gains and maintainability improvements.

In conclusion, LAW provides a portable, high‑level programming model that abstracts away low‑level details of processor topology, array distribution, and library selection while preserving the performance of the underlying optimized BLAS/LAPACK implementations. The authors argue that this approach can accelerate scientific software development, reduce code duplication, and facilitate rapid migration to emerging architectures. Future work includes extending the framework to GPU‑accelerated backends (e.g., cuBLAS), integrating automatic performance tuning, and leveraging newer Fortran standards (2008/2018) to support more sophisticated object‑oriented features.