Graph Expansion and Communication Costs of Fast Matrix Multiplication
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to store three $n$-by-$n$ matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, $\Omega((\frac{n}{\sqrt M})^{\omega_0}\cdot M)$, where $\omega_0$ is the exponent in the arithmetic count (e.g., $\omega_0 = \lg 7$ for Strassen, and $\omega_0 = 3$ for conventional matrix multiplication). With $p$ parallel processors, each with fast memory of size $M$, the lower bound is $p$ times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation).
💡 Research Summary
The paper “Graph Expansion and Communication Costs of Fast Matrix Multiplication” establishes a deep connection between the expansion properties of a computation’s directed acyclic graph (CDAG) and the I/O (communication) complexity of fast matrix multiplication algorithms such as Strassen’s method and its variants.
The authors first formalize a two‑level memory model for sequential computation (fast memory of size M and an unbounded slow memory) and a parallel model with p processors each equipped with a local memory of size M. Communication cost is measured both in terms of bandwidth (total words moved) and latency (total number of messages), using the standard α + β n cost model.
A CDAG represents every input, intermediate, and output value as a vertex, with directed edges encoding data dependencies. In a parallel setting, partitioning the CDAG among processors creates inter‑processor edges that correspond exactly to required communication. The I/O complexity of an algorithm is defined as the minimum bandwidth cost over all possible implementations (orderings or partitions) that respect the CDAG’s partial order.
Previous work had already derived tight I/O lower bounds for the classical Θ(n³) matrix multiplication (Hong‑Kung 1981; Irony‑Tiskin‑Träff 2004), but fast algorithms based on recursion (Strassen 1969, Winograd, Coppersmith‑Winograd, etc.) resisted those techniques because their computation graphs are not amenable to Loomis‑Whitney geometric arguments.
The core contribution of this paper is a novel expansion‑based method. The authors analyze the edge expansion of the CDAG: for any subset S of vertices, the number of edges leaving S is proportional to the size of S raised to a power that depends on the algorithm’s arithmetic exponent ω₀ (e.g., ω₀ = log₂ 7 ≈ 2.807 for Strassen). If a subset of size s must be kept entirely in fast memory, then s ≤ M; otherwise, at least Ω(s·(n/√M)^{ω₀‑2}) words must cross the memory boundary. By summing this requirement over the recursive levels of a Strassen‑like algorithm, they obtain a lower bound on the total number of words moved:
Ω\big((n/√M)^{ω₀}·M\big) for the sequential case,
and, after a standard reduction from sequential to parallel models,
Ω\big((n/√M)^{ω₀}·M / p\big) for p processors.
These bounds hold for any implementation that does not recompute intermediate values, a condition satisfied by all known practical variants of Strassen’s algorithm (including Winograd’s 15‑addition version).
To show that the bounds are tight, the authors present a concrete implementation strategy: recursively apply the fast algorithm until sub‑problems reach size Θ(√M), then perform the remaining multiplication entirely inside fast memory using a naïve O(M) I/O cost. This yields an upper bound
O\big((n/√M)^{ω₀}·M\big)
which matches the lower bound up to constant factors, proving optimality. The same reasoning extends to any “Strassen‑like” algorithm—defined as a recursive scheme that multiplies two constant‑size matrices to build larger products and runs in O(n^{ω₀}) arithmetic operations for 2 ≤ ω₀ < 3. Consequently, the paper provides the first known I/O lower bounds for all such fast matrix multiplication families.
Beyond matrix multiplication, the authors note that many dense linear‑algebra kernels (LU, QR, Sylvester equation solvers, etc.) are built from a constant number of matrix products. Because the communication lower bound for the underlying multiplication propagates through these higher‑level algorithms, the same Ω\big((n/√M)^{ω₀}·M\big) bound applies, and optimal implementations can be obtained by embedding the optimal multiplication kernels within the larger algorithmic framework.
The paper also discusses practical considerations: the optimal recursion depth depends on the fast memory size M, and the theoretical model assumes no contention and a single message at a time per processor (both assumptions can be relaxed with at most constant‑factor penalties). The authors acknowledge that while the expansion‑based technique yields clean bounds for fast algorithms, it is fundamentally different from the Loomis‑Whitney approach used for classical multiplication, and each method appears specialized to its respective class of algorithms.
In summary, Ballard, Demmel, Holtz, and Schwartz have introduced a powerful graph‑theoretic tool to analyze communication costs of fast matrix multiplication. They prove that for any Strassen‑like algorithm with arithmetic exponent ω₀, the I/O complexity on a machine with fast memory M (or on p processors each with memory M) is Θ\big((n/√M)^{ω₀}·M\big) (or divided by p in the parallel case). The bounds are both provably optimal and achievable, and they extend naturally to a broad class of linear‑algebra operations, offering a unified framework for designing communication‑optimal high‑performance algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment