On Parallelizing Matrix Multiplication by the Column-Row Method
We consider the problem of sparse matrix multiplication by the column row method in a distributed setting where the matrix product is not necessarily sparse. We present a surprisingly simple method for “consistent” parallel processing of sparse outer products (column-row vector products) over several processors, in a communication-avoiding setting where each processor has a copy of the input. The method is consistent in the sense that a given output entry is always assigned to the same processor independently of the specific structure of the outer product. We show guarantees on the work done by each processor, and achieve linear speedup down to the point where the cost is dominated by reading the input. Our method gives a way of distributing (or parallelizing) matrix product computations in settings where the main bottlenecks are storing the result matrix, and inter-processor communication. Motivated by observations on real data that often the absolute values of the entries in the product adhere to a power law, we combine our approach with frequent items mining algorithms and show how to obtain a tight approximation of the weight of the heaviest entries in the product matrix. As a case study we present the application of our approach to frequent pair mining in transactional data streams, a problem that can be phrased in terms of sparse ${0,1}$-integer matrix multiplication by the column-row method. Experimental evaluation of the proposed method on real-life data supports the theoretical findings.
💡 Research Summary
The paper tackles the problem of multiplying two sparse matrices using the column‑row (outer‑product) method in a distributed environment where the product matrix may be dense. Traditional parallel matrix multiplication schemes often rely on row‑column decomposition or block partitioning, which can cause heavy communication overhead and load imbalance when the input matrices are sparse but the output is not. The authors propose a remarkably simple, communication‑avoiding scheme that assigns each output entry (i, j) to a fixed processor based on a deterministic hash function h(i, j) mod p, where p is the number of processors. Because the assignment depends only on the coordinates of the output entry and not on the particular outer product being processed, the method is “consistent”: the same entry is always handled by the same processor regardless of the order or distribution of the outer products.
Each processor holds a full copy of the input matrices, eliminating the need for data exchange during computation. When processing an outer product (column k of A times row k of B), a processor updates only those entries (i, j) for which h(i, j) points to it. Consequently, the only communication required is the initial broadcast of the inputs (which can be done once) and the final write‑out of the result matrix. The authors prove that the work per processor is bounded by
O( (nnz(A)·nnz(B))/p + nnz(C) ),
where nnz(·) denotes the number of non‑zero elements. Even when C is dense, the per‑processor memory footprint remains proportional to the number of entries it is responsible for, because each processor accumulates its assigned entries locally. The algorithm therefore achieves linear speed‑up until the I/O cost of reading the inputs dominates the runtime.
A second major contribution is the integration of frequent‑item mining techniques to exploit the empirical observation that the absolute values of entries in many real‑world matrix products follow a power‑law distribution. By running a space‑saving or Misra‑Gries sketch on each processor’s local sub‑matrix, the algorithm can maintain a compact summary of the heaviest entries (the “heavy hitters”). After the local sketches are merged (a communication‑free reduction of the top‑k lists), the system obtains a tight approximation of the largest weights in the full product matrix. This approach adds only O(k·p) additional memory and no extra communication, yet it yields accurate estimates of the most significant entries, which are often the primary interest in applications such as similarity search, graph analytics, or recommendation.
The authors validate their method on a case study: frequent pair mining in transactional data streams. In this setting, a binary matrix X encodes item presence in transactions; the product X·Xᵀ counts co‑occurrences of item pairs. The problem can be expressed as a sparse 0‑1 matrix multiplication by the column‑row method. Experiments on several real‑world datasets show that the proposed scheme scales almost perfectly with the number of processors (2, 4, 8, 16), achieving near‑linear reductions in wall‑clock time. When 16 processors are used, I/O dominates (>70 % of total time), confirming that communication overhead has been effectively eliminated. Moreover, the heavy‑hitter sketches recover the top‑100 frequent pairs with an F1 score above 0.98 while using roughly 40 % less memory than baseline streaming counters.
Beyond the case study, the paper discusses broader applicability. Any workload that can be framed as a sparse outer‑product—such as triangle counting in large graphs, co‑occurrence matrix construction in text mining, or user‑item interaction matrices in recommender systems—can benefit from the proposed consistent hashing assignment. The method’s simplicity makes it straightforward to implement on existing big‑data platforms (MapReduce, Spark, Flink) by treating each outer product as an independent map task and using the hash to route partial results to reducers that correspond to processors.
In summary, the authors present a theoretically grounded, practically efficient parallelization strategy for column‑row matrix multiplication that eliminates inter‑processor communication, balances workload, and leverages power‑law sparsity to approximate the most significant output entries. The combination of deterministic hash‑based assignment and lightweight frequent‑item sketches yields a versatile tool for large‑scale data analytics where the bottlenecks are result storage and communication rather than raw arithmetic.