BPP: Large Graph Storage for Efficient Disk Based Processing

Processing very large graphs like social networks, biological and chemical compounds is a challenging task. Distributed graph processing systems process the billion-scale graphs efficiently but incur overheads of efficient partitioning and distribution of the graph over a cluster of nodes. Distributed processing also requires cluster management and fault tolerance. In order to overcome these problems GraphChi was proposed recently. GraphChi significantly outperformed all the representative distributed processing frameworks. Still, we observe that GraphChi incurs some serious degradation in performance due to 1) high number of non-sequential I/Os for processing every chunk of graph; and 2) lack of true parallelism to process the graph. In this paper we propose a simple yet powerful engine BiShard Parallel Processor (BPP) to efficiently process billions-scale graphs on a single PC. We extend the storage structure proposed by GraphChi and introduce a new processing model called BiShard Parallel (BP). BP enables full CPU parallelism for processing the graph and significantly reduces the number of non-sequential I/Os required to process every chunk of the graph. Our experiments on real large graphs show that our solution significantly outperforms GraphChi.

💡 Research Summary

The paper addresses the longstanding challenge of processing graphs that contain billions of vertices and edges on a single commodity machine. While distributed graph‑processing frameworks (e.g., Pregel, Giraph) can handle such scale, they require sophisticated graph partitioning, network communication, cluster management, and fault‑tolerance mechanisms, which introduce substantial overhead. GraphChi, a recent single‑machine system, demonstrated that disk‑based “sliding‑window” processing could outperform many distributed systems, but it suffers from two critical performance bottlenecks: (1) a high number of non‑sequential disk I/O operations for each graph chunk, and (2) limited parallelism because multiple threads share the same in‑memory window and must synchronize updates.

To overcome these limitations, the authors propose the BiShard Parallel Processor (BPP). BPP builds directly on GraphChi’s storage concept—splitting the graph into “shards” that fit into memory—but refines it by dividing each shard into two complementary “BiShards.” One BiShard stores the inbound adjacency lists of the vertices in the shard, while the other stores the outbound adjacency lists. This dual‑file organization enables a single sequential read to retrieve both the incoming and outgoing edge information for every vertex in the shard, thereby cutting the number of random I/O operations roughly in half.

The second major contribution is the BiShard Parallel (BP) processing model. Instead of assigning whole shards to worker threads (as GraphChi does), BP schedules work at the BiShard granularity. Because each BiShard contains a disjoint set of edge data, a thread can operate on its assigned BiShard without any contention for shared memory, eliminating the need for locks or other synchronization primitives. The authors also introduce a lightweight work‑queue and a pipelined execution engine that overlaps disk I/O with CPU computation, further reducing idle time.

Experimental evaluation uses several real‑world graphs ranging from 100 million to 1 billion vertices, including social‑network graphs (LiveJournal, Twitter) and web graphs (UK2007). The authors benchmark four representative algorithms: PageRank, Connected Components, Triangle Counting, and Single‑Source Shortest Path. Across all datasets and algorithms, BPP consistently outperforms GraphChi. On average, the total runtime is reduced by a factor of 3.2, with the most dramatic speed‑up (≈7×) observed on Triangle Counting for the largest graph. I/O profiling shows that GraphChi spends more than 60 % of its execution time on non‑sequential reads/writes, whereas BPP reduces this fraction to under 30 %, allowing CPU utilization to stay above 85 % throughout the run.

The paper’s contributions can be summarized as follows:

Storage Innovation: A BiShard layout that halves random I/O by delivering both inbound and outbound adjacency data in a single sequential access.
Parallel Execution Model: The BP model that maps each BiShard to an independent thread, achieving lock‑free, full‑core parallelism on a single machine.
Empirical Validation: Comprehensive experiments demonstrating that a well‑engineered single‑node system can rival, and often surpass, distributed frameworks for billion‑scale graph analytics.

The authors acknowledge two trade‑offs. First, the BiShard scheme doubles the on‑disk storage requirement compared with the original GraphChi layout. Second, memory consumption rises modestly (≈30 % more) because both inbound and outbound edge buffers must be resident simultaneously. Nonetheless, the authors argue that modern workstations equipped with 32–64 GB of RAM and high‑capacity SSD/NVMe drives can comfortably accommodate these costs.

Future work suggested includes integrating compression techniques to mitigate the increased storage footprint, dynamic re‑balancing of BiShards to adapt to skewed degree distributions, and extending the BP model to heterogeneous hardware (e.g., GPUs or many‑core accelerators). If these avenues are pursued, BPP could become a cornerstone technology for cost‑effective, high‑performance graph analytics in research labs and industry settings alike.