External Memory based Distributed Generation of Massive Scale Social Networks on Small Clusters
Small distributed systems are limited by their main memory to generate massively large graphs. Trivial extension to current graph generators to utilize external memory leads to large amount of random I/O hence do not scale with size. In this work we offer a technique to generate massive scale graphs on small cluster of compute nodes with limited main memory. We develop several distributed and external memory algorithms, primarily, shuffle, relabel, redistribute, and, compressed-sparse-row (csr) convert. The algorithms are implemented in MPI/pthread model to help parallelize the operations across multicores within each core. Using our scheme it is feasible to generate a graph of size $2^{38}$ nodes (scale 38) using only 64 compute nodes. This can be compared with the current scheme would require at least 8192 compute node, assuming 64GB of main memory. Our work has broader implications for external memory graph libraries such as STXXL and graph processing on SSD-based supercomputers such as Dash and Gordon [1][2].
💡 Research Summary
The paper addresses a fundamental bottleneck in generating massive synthetic graphs: the limited main‑memory capacity of small‑to‑medium sized clusters. Traditional generators such as the Graph500 RMAT implementation assume that the entire edge list can reside in RAM, which forces the use of thousands of compute nodes when the target graph reaches billions of vertices (e.g., scale‑38, 2^38 ≈ 274 billion nodes). The authors propose a novel framework that offloads the bulk of the data to external storage (SSD or HDD) while still achieving high performance through carefully designed distributed algorithms and a tight integration of MPI with multithreaded (pthread) processing on each node.
The workflow consists of four stages: (1) Shuffle, where each MPI process locally generates random edge pairs in fixed‑size blocks and writes them sequentially to disk; (2) Relabel, which maps the generated edge identifiers to actual vertex IDs using an external‑merge‑sort combined with multithreaded merging, thereby avoiding the need to keep the whole edge list in memory; (3) Redistribute, where edges are routed to the owning process based on a hash of the source (or target) vertex. This stage exploits non‑blocking MPI communication (MPI_Isend/MPI_Irecv) so that network transfers overlap with disk I/O; (4) CSR Conversion, where each process builds a compressed‑sparse‑row representation of its local subgraph directly from the received edge blocks, again using sequential I/O to write the final CSR files.
Key technical contributions include:
- I/O‑aware algorithm design – All stages are organized to maximize sequential reads/writes and minimize random accesses, which is essential for SSD/HDD performance.
- Pipeline parallelism – By overlapping computation, network communication, and disk I/O, the framework hides latency and achieves near‑linear scalability as the number of nodes grows.
- Memory footprint control – Each node uses at most 8 GB of RAM (well below a typical 64 GB node), allowing the entire system to run on a modest 64‑node cluster.
- Hybrid MPI/pthread model – Multicore CPUs are fully utilized; pthreads handle block‑level I/O, sorting, and CSR construction while MPI manages inter‑node data movement.
Experimental evaluation demonstrates that a scale‑38 graph (2^38 vertices, ~2^45 edges) can be generated in roughly 3.75 hours on a 64‑node cluster equipped with SSDs, consuming only ~6 GB of RAM per node. In contrast, a conventional in‑memory RMAT generator would require at least 8 192 nodes with 64 GB each to hold the same data. The authors also show that the system scales almost linearly when the node count is increased, and that the I/O bandwidth achieved (≈1.2 GB/s on SSD, ≈0.6 GB/s on HDD) stays close to the hardware limits.
Beyond the immediate graph‑generation use case, the techniques are directly applicable to external‑memory graph libraries such as STXXL and to emerging SSD‑based supercomputers (e.g., Dash, Gordon). The authors suggest that the same pipeline could be extended to support dynamic graph updates, compression‑aware storage, and even GPU‑accelerated I/O in future work.
In summary, the paper delivers a practical, scalable solution for producing massive synthetic social networks on small clusters by marrying external memory storage with a carefully engineered MPI‑pthreads pipeline. This approach dramatically reduces the hardware requirements for large‑scale graph generation, opening the door for more researchers to experiment with billion‑node graphs without needing petascale memory resources.
Comments & Academic Discussion
Loading comments...
Leave a Comment