Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen’s fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.

💡 Research Summary

The paper investigates the fundamental limits of strong scaling for distributed‑memory matrix multiplication algorithms, focusing on both the classical O(n³) algorithm and Strassen’s fast O(n^{log₂7}) algorithm. Strong scaling is defined as a situation where the total runtime on P processors decreases exactly proportionally to 1/P, encompassing both computation and communication phases. While prior work (Solomonik and Demmel 2011 for the classical algorithm, Ballard et al. 2012 for a Strassen‑based algorithm) demonstrated perfect strong scaling, this property held only up to a certain processor count; beyond that, communication costs ceased to shrink linearly with P.

The central contribution of the paper is a memory‑independent communication lower bound. Traditional communication lower bounds depend on the local memory size M of each processor (e.g., Θ(n²/√P) for classical multiplication). The authors derive bounds that are independent of M, relying solely on the problem size n and the number of processors P. For any distributed‑memory algorithm that computes the product of two n×n matrices, the total volume of data that must be moved across the network is at least Ω(n²/P) for the classical algorithm and Ω(n^{log₂7}/P) for any Strassen‑based algorithm. These bounds are proved using combinatorial arguments on data dependencies and the fact that each output entry depends on an entire row and column of the inputs.

By comparing these universal lower bounds with the communication costs of the two known strongly‑scaling algorithms, the authors show that the existing algorithms are essentially optimal with respect to strong scaling. The Solomonik‑Demmel algorithm achieves perfect strong scaling as long as P ≤ Θ(n³ / M^{3/2}), while the Ballard‑et‑al. Strassen algorithm does so for P ≤ Θ(n^{log₂7} / M^{log₂7/2}). Once P exceeds these thresholds, the memory‑independent lower bound forces the communication volume to stop decreasing proportionally to 1/P, causing the overall runtime to deviate from the ideal 1/P scaling. Consequently, no classical or Strassen‑based parallel matrix multiplication algorithm can exhibit perfect strong scaling beyond the processor counts already attained by these two methods.

The paper further argues that the memory‑independent bounds are not limited to matrix multiplication. Similar arguments apply to related linear‑algebra kernels such as matrix transposition, block matrix multiplication, and even to more exotic fast multiplication schemes (e.g., Coppersmith‑Winograd variants). In each case, the bound ties the minimal communication cost directly to the algebraic complexity of the algorithm, independent of how much local memory each processor possesses.

From a practical perspective, these results provide a clear guideline for algorithm designers: any attempt to push strong scaling further must either increase the per‑processor memory (thereby moving back into the memory‑dependent regime) or develop fundamentally new communication‑avoiding techniques that alter the data‑dependency graph. The authors suggest avenues such as overlapping communication with computation, exploiting hierarchical network topologies, or redesigning the algorithmic structure to reduce the number of required data exchanges.

In summary, the paper establishes that the strong‑scaling limits observed for the best known classical and Strassen‑based matrix multiplication algorithms are not artifacts of current implementations but are dictated by intrinsic, memory‑independent communication lower bounds. This insight both validates the optimality of the existing algorithms and delineates the theoretical frontier for future research in high‑performance, distributed‑memory linear algebra.