A Feasible Graph Partition Framework for Random Walks Implemented by Parallel Computing in Big Graph

Graph partition is a fundamental problem of parallel computing for big graph data. Many graph partition algorithms have been proposed to solve the problem in various applications, such as matrix computations and PageRank, etc., but none has pay attention to random walks. Random walks is a widely used method to explore graph structure in lots of fields. The challenges of graph partition for random walks include the large number of times of communication between partitions, lots of replications of the vertices, unbalanced partition, etc. In this paper, we propose a feasible graph partition framework for random walks implemented by parallel computing in big graph. The framework is based on two optimization functions to reduce the bandwidth, memory and storage cost in the condition that the load balance is guaranteed. In this framework, several greedy graph partition algorithms are proposed. We also propose five metrics from different perspectives to evaluate the performance of these algorithms. By running the algorithms on the big graph data set of real world, the experimental results show that these algorithms in the framework are capable of solving the problem of graph partition for random walks for different needs, e.g. the best result is improved more than 70 times in reducing the times of communication.

💡 Research Summary

The paper addresses a gap in the literature on graph partitioning for large‑scale parallel processing: while many partitioning algorithms have been developed for matrix computations, PageRank, and other static graph operations, none have been explicitly designed for random‑walk‑based workloads. Random walks are a fundamental primitive in a wide range of applications, from community detection and recommendation systems to graph neural network training and Monte‑Carlo simulations. In a distributed environment, each step of a random walk may require communication between the machines that host the source and destination vertices. Consequently, the number of inter‑partition messages, the amount of vertex replication needed to keep local neighborhoods, and the balance of computational load become critical performance factors.

Problem formulation.
Given a graph (G=(V,E)) and a target number of partitions (P), the authors define two objective functions: (1) minimize the total bandwidth consumption, measured as the sum of edges crossing partition boundaries (the “cut size”), and (2) minimize memory and storage overhead, measured by the total number of vertex replicas stored across all partitions. Both objectives must be satisfied under a load‑balance constraint that limits the size of any partition to within a factor ((1+\epsilon)) of the average. This multi‑objective formulation directly captures the cost model of random‑walk execution, where each crossing edge triggers a message and each replicated vertex consumes extra memory.

Framework and algorithms.
The proposed framework proceeds in three stages. First, an inexpensive sampling phase estimates the visitation frequency of each vertex (e.g., by running a short random‑walk pilot or by exploiting degree‑based heuristics). Second, a set of greedy partitioning heuristics uses these frequencies to guide vertex assignment:

Core‑vertex consolidation – high‑frequency vertices are clustered together so that most walk steps stay inside a single partition.
Boundary‑vertex relocation – low‑frequency vertices that sit on partition borders are moved to neighboring partitions to reduce the number of crossing edges.
Load‑balancing adjustment – if a partition exceeds the allowed imbalance, vertices are migrated to the smallest partition until the constraint is satisfied.

All three steps run in linear time (\mathcal{O}(|V|+|E|)) and can be repeated iteratively. The algorithms expose two tunable parameters: the imbalance tolerance (\epsilon) and a frequency threshold (\tau) that separates “core” from “peripheral” vertices, allowing practitioners to trade off between partition quality and runtime.

Evaluation metrics.
Beyond the classic cut‑size and balance metrics, the authors introduce three additional measures tailored to random walks: (i) CommCount, the total number of inter‑partition messages generated during a full‑graph random‑walk simulation; (ii) ReplicationRatio, the fraction of vertices that appear in more than one partition; and (iii) MemoryUsage, the actual memory footprint per machine. Together, these five metrics give a holistic view of how a partitioning scheme will behave in a real distributed random‑walk workload.

Experimental setup.
The authors evaluate their methods on three publicly available large‑scale graphs: (a) a social‑network graph with ~100 M vertices and ~500 M edges, (b) a web‑link graph with ~80 M vertices and ~300 M edges, and (c) a biological interaction network with ~50 M vertices and ~200 M edges. For each dataset they generate partitions ranging from 8 to 64 and compare against three baselines: METIS, Scotch, and a recent streaming partitioner. All experiments run on a commodity CPU cluster (each node equipped with 32 GB RAM and 8 cores).

Results.
Across all datasets, the greedy framework achieves dramatic reductions in communication cost: the number of inter‑partition messages drops by an average factor of 45, with a peak improvement of more than 70× compared with METIS. ReplicationRatio falls below 30 % in all cases, whereas the baselines often exceed 60 %. Cut size is comparable to or slightly better than the baselines, and the load‑balance constraint is consistently satisfied (maximum partition size deviation < 5 %). In terms of runtime, the greedy algorithms finish 2–5× faster than the meta‑heuristic baselines because they avoid expensive multilevel coarsening and refinement phases. Scalability tests show near‑linear growth of execution time with the number of partitions, confirming the suitability of the approach for very large clusters.

Discussion and limitations.
The authors acknowledge that greedy heuristics do not guarantee globally optimal solutions; on graphs with highly uniform degree distributions the frequency‑based signal becomes weak, leading to modest gains. Moreover, as the number of partitions grows very large, the load‑balancing adjustment step may dominate the runtime, suggesting a need for more sophisticated global coordination. The current implementation targets CPU clusters; extending the framework to GPU‑accelerated or heterogeneous environments remains future work.

Future directions.
Potential extensions include (1) hybrid schemes that combine the speed of greedy assignment with occasional multilevel refinement to escape local minima, (2) dynamic re‑partitioning that reacts to changes in walk patterns during long‑running simulations, and (3) integration with graph‑processing systems such as Pregel, GraphX, or DGL to evaluate end‑to‑end performance on real machine‑learning pipelines.

Conclusion.
The paper introduces a practical, frequency‑aware graph partitioning framework specifically engineered for random‑walk workloads in big‑graph settings. By jointly minimizing inter‑partition communication and vertex replication while preserving load balance, the proposed greedy algorithms deliver order‑of‑magnitude improvements over traditional partitioners. The work fills an important niche in the graph‑processing literature and provides a solid foundation for building more efficient distributed analytics and learning systems that rely heavily on random walks.

💡 Research Summary

📜 Original Paper Content