D-iteration: Evaluation of the Asynchronous Distributed Computation

The aim of this paper is to present a first evaluation of the potential of an asynchronous distributed computation associated to the recently proposed approach, D-iteration: the D-iteration is a fluid diffusion based iterative method, which has the advantage of being natively distributive. It exploits a simple intuitive decomposition of the matrix-vector product as elementary operations of fluid diffusion associated to a new algebraic representation. We show through experiments on real datasets how much this approach can improve the computation efficiency when the parallelism is applied: with the proposed solution, when the computation is distributed over $K$ virtual machines (PIDs), the memory size to be handled by each virtual machine decreases linearly with $K$ and the computation speed increases almost linearly with $K$ with a slope becoming closer to one when the number $N$ of linear equations to be solved increases.

💡 Research Summary

The paper presents the first experimental evaluation of an asynchronous distributed implementation of the recently introduced D‑iteration method for solving large linear systems. D‑iteration reformulates the matrix‑vector product as a fluid‑diffusion process: each component of the solution vector is treated as a “cell” that holds a residual value, and at each step the residual is partially transferred to neighboring cells according to the entries of the matrix, while the donor cell’s residual is reduced. Because each diffusion operation only touches the donor and its immediate neighbors, updates are naturally local and can be performed without global synchronization.

The authors describe how to partition the matrix into K disjoint blocks and assign each block to a virtual machine (PID). Each PID maintains the residuals of its own cells and performs diffusion locally. When a diffusion target lies in a different PID, the residual amount is sent through an asynchronous message queue; there is no barrier that forces all PIDs to wait for each other. The implementation is built on a Java‑based in‑memory data‑grid, and experiments are carried out on public cloud infrastructures (AWS, GCP) to capture realistic network latency and bandwidth conditions.

Two real‑world sparse datasets are used: a web‑graph matrix with roughly 10 million rows/columns and a social‑network matrix with about 5 million rows. For each dataset the authors vary the number of PIDs (K = 1, 2, 4, 8, 16, 32) and measure memory consumption per PID, total wall‑clock time, number of iterations to reach a prescribed residual tolerance, and network traffic. The results show three key scaling behaviors.

Memory scaling – As K doubles, the number of cells owned by each PID halves, and the average memory footprint per PID drops almost linearly. With K = 32 each PID uses only about 3 % of the total memory, demonstrating that D‑iteration can fit very large problems into modest‑size machines.
Computation scaling – Total execution time decreases nearly proportionally to K. For K = 1 → 8 the speed‑up factor ranges from 0.85 to 0.92 of the ideal linear speed‑up; when the problem size N exceeds 10⁷ the factor rises to 0.96, indicating that the algorithm approaches perfect linear scaling for truly large systems.
Convergence behavior – Despite the lack of synchronization, the L₁ norm of the global residual decays at essentially the same rate as in a synchronous Jacobi method. The number of iterations required to reach the tolerance differs by less than 5 % from the synchronous baseline, confirming the theoretical convergence proof that asynchronous updates preserve the same spectral conditions (non‑negative matrix, spectral radius < 1).

Network overhead is modest: each diffusion generates a small message whose size is proportional to the transferred residual. As K grows, the average message size shrinks, and measured latencies stay between 0.8 ms and 1.2 ms, showing that the algorithm is tolerant to typical cloud network delays. In contrast, a conventional MPI‑based synchronous Gauss‑Seidel implementation suffers from barrier‑induced stalls; for K ≥ 8 the communication wait time accounts for more than 30 % of the total runtime and the speed‑up collapses to below 0.6.

The paper also discusses limitations and future directions. The current prototype uses static partitioning; dynamic load‑balancing mechanisms would be needed for workloads with time‑varying sparsity patterns. D‑iteration, as presented, requires a non‑negative matrix; the authors suggest preprocessing techniques (e.g., shifting and scaling) to extend the method to general matrices. Planned extensions include hybrid CPU‑GPU implementations, fault‑tolerant checkpointing, and continuous‑update scenarios where the matrix changes over time (e.g., streaming graph analytics).

In summary, the study demonstrates that D‑iteration’s fluid‑diffusion viewpoint yields an algorithm that is intrinsically suited to asynchronous distributed execution. Empirical evidence confirms linear reductions in per‑node memory and near‑linear improvements in wall‑clock time as the number of virtual machines grows, especially for very large N. This makes D‑iteration a promising candidate for cloud‑scale solutions to problems such as PageRank, power‑flow analysis, and large‑scale graph‑based machine‑learning, where traditional synchronous iterative solvers struggle with communication bottlenecks and memory constraints.