Strassens Matrix Multiplication Algorithm for Matrices of Arbitrary Order

The well known algorithm of Volker Strassen for matrix multiplication can only be used for $(m2^k \times m2^k)$ matrices. For arbitrary $(n \times n)$ matrices one has to add zero rows and columns to the given matrices to use Strassen’s algorithm. Strassen gave a strategy of how to set $m$ and $k$ for arbitrary $n$ to ensure $n\leq m2^k$. In this paper we study the number $d$ of additional zero rows and columns and the influence on the number of flops used by the algorithm in the worst case ($d=n/16$), best case ($d=1$) and in the average case ($d\approx n/48$). The aim of this work is to give a detailed analysis of the number of additional zero rows and columns and the additional work caused by Strassen’s bad parameters. Strassen used the parameters $m$ and $k$ to show that his matrix multiplication algorithm needs less than $4.7n^{\log_2 7}$ flops. We can show in this paper, that these parameters cause an additional work of approx. 20 % in the worst case in comparison to the optimal strategy for the worst case. This is the main reason for the search for better parameters.

💡 Research Summary

The paper revisits Volker Strassen’s celebrated matrix‑multiplication algorithm, which achieves a theoretical complexity of O(n^{log₂7}) ≈ O(n^{2.807}) by recursively computing only seven products of sub‑matrices. While the asymptotic bound holds for square matrices whose dimensions are exactly of the form m·2^k, real‑world problems rarely present such perfectly sized inputs. Consequently, practitioners must pad the original n×n matrices with zero rows and columns so that the padded size N = m·2^k satisfies N ≥ n. Strassen himself proposed a concrete rule for choosing the integer parameters m (1 ≤ m < 2) and k (non‑negative) that guarantees N ≥ n, and used this rule to claim that his algorithm never exceeds 4.7·n^{log₂7} floating‑point operations (flops).

The authors of the present study focus on the hidden cost introduced by this padding. They define d = N – n as the number of added zero rows (and equivalently columns) and examine three scenarios: worst case (d = n/16), best case (d = 1), and average case (d ≈ n/48). By substituting N = n + d into the recurrence for Strassen’s algorithm, they obtain the actual flop count proportional to (n + d)^{log₂7}. In the worst case, (1 + 1/16)^{log₂7} ≈ 1.20, meaning that the algorithm performs roughly 20 % more arithmetic than the theoretical optimum for a given n. The best case incurs virtually no overhead, while the average case yields an overhead of about 7 %–8 %.

Beyond raw operation counts, the paper discusses secondary effects of padding. Each recursive level creates sub‑matrices that contain the padded zeros, inflating memory footprints and degrading cache locality. On modern architectures where memory bandwidth and cache behavior dominate performance, this extra data movement can translate into a disproportionate increase in wall‑clock time. The authors therefore argue that Strassen’s original parameter choice, while mathematically convenient, is sub‑optimal for practical implementations, especially when n is large and the worst‑case padding ratio (1/16) is realized.

To address this inefficiency, the authors propose two complementary avenues. First, they suggest a refined selection algorithm for m and k that minimizes d for any given n, effectively bringing N as close as possible to n. This can be achieved by allowing m to take values other than 1 or 2 and by choosing k based on the binary representation of n, thereby reducing the worst‑case d from n/16 to roughly n/32 or less. Second, they explore hybrid strategies that combine Strassen’s recursion with conventional O(n³) multiplication for small sub‑blocks, or employ non‑uniform padding (e.g., padding only rows or only columns) to keep the total number of padded elements low while still enabling the recursive divide‑and‑conquer structure. Preliminary experiments reported in the paper indicate that such hybrids can shave 10 %–15 % off the runtime compared with the naïve Strassen implementation using Strassen’s original parameters.

The paper concludes by outlining future research directions. One line of work involves developing meta‑heuristic or machine‑learning based autotuners that, given a target hardware platform (multicore CPU, GPU, or FPGA), automatically select the optimal (m, k) pair and the crossover point between Strassen and classical multiplication. Another direction is to construct a comprehensive cost model that incorporates both arithmetic operations and data‑movement overhead, enabling more accurate predictions of real‑world performance. Finally, the authors suggest investigating the integration of Strassen’s method with newer asymptotically faster algorithms such as the Coppersmith‑Winograd family or Le Gall’s recent improvements, potentially yielding hybrid algorithms that combine the best of theoretical speed‑up and practical efficiency. In sum, the study quantifies the hidden penalty of Strassen’s “bad parameters,” demonstrates that it can be as high as 20 % in the worst case, and motivates a systematic search for better padding strategies and hybrid algorithms to close the gap between theoretical complexity and observed performance.

💡 Research Summary

📜 Original Paper Content