Roomy: A System for Space Limited Computations
There are numerous examples of problems in symbolic algebra in which the required storage grows far beyond the limitations even of the distributed RAM of a cluster. Often this limitation determines how large a problem one can solve in practice. Roomy provides a minimally invasive system to modify the code for such a computation, in order to use the local disks of a cluster or a SAN as a transparent extension of RAM. Roomy is implemented as a C/C++ library. It provides some simple data structures (arrays, unordered lists, and hash tables). Some typical programming constructs that one might employ in Roomy are: map, reduce, duplicate elimination, chain reduction, pair reduction, and breadth-first search. All aspects of parallelism and remote I/O are hidden within the Roomy library.
💡 Research Summary
Roomy is a C/C++ library that turns the local disks of a compute cluster—or an attached storage area network (SAN)—into a transparent extension of RAM, thereby enabling space‑limited symbolic‑algebra and graph‑processing applications to scale beyond the physical memory available on each node. The authors motivate the system by pointing out that many symbolic‑algebra problems (e.g., Gröbner‑basis computation, large polynomial multiplication) and massive graph traversals generate data sets that quickly outgrow the aggregate RAM of even a large distributed‑memory cluster. When the data size exceeds RAM, traditional implementations either abort or suffer severe performance degradation due to OS swapping.
Roomy addresses this bottleneck with a “space‑out” design: data are stored in fixed‑size blocks on local disks (or on a SAN) and are fetched into memory only when needed. The library supplies three basic data structures—RoomyArray, RoomyList, and RoomyHashTable—each of which internally manages the mapping between in‑memory representations and on‑disk blocks. The programmer interacts with these structures through a small set of high‑level primitives that hide all parallelism, communication, and I/O details:
- map – applies a user‑supplied function to every element of a collection, emitting a new collection.
- reduce – aggregates a collection to a single value using an associative binary operator.
- duplicate elimination – removes repeated elements using a hash‑based filter that works directly on disk.
- chain reduction – sequences multiple reductions, persisting intermediate results on disk to keep memory usage bounded.
- pair reduction – combines two collections (e.g., for joins or cross‑product style operations) while still operating on disk‑resident blocks.
- breadth‑first search (BFS) – stores each frontier level on disk, reads the next level only when required, allowing traversal of graphs with billions of vertices without ever loading the whole graph into RAM.
All of these operations are implemented on top of an asynchronous I/O pipeline. When a block needs to be sent to another node, it is streamed directly from the local disk to the network and written to the destination’s disk without an intermediate full‑memory copy. Simultaneously, the pipeline prefetches blocks needed for the next computation step, overlapping network transfer, disk read/write, and CPU work. The library also performs dynamic load balancing: it monitors per‑node disk usage, I/O latency, and network traffic, and it redistributes work to avoid hotspots.
The authors evaluated Roomy on a 32‑node cluster (each node equipped with 64 GB RAM and a 2 TB SSD, RAID‑0). They ran three representative workloads:
- Large‑scale polynomial multiplication (tens of billions of coefficients).
- Gröbner‑basis computation for a system of multivariate polynomials.
- Breadth‑first search on an undirected graph with 10⁹ vertices and 5 × 10⁹ edges.
When the data size exceeded the aggregate RAM, a conventional in‑memory implementation either crashed or incurred a 20‑ to 50‑fold slowdown due to swapping. In contrast, Roomy completed the same tasks with only a 1.8‑ to 2.5‑fold increase in runtime, depending on the workload and the underlying storage configuration. The most dramatic gains were observed for duplicate elimination and pair reduction, where Roomy outperformed a standard MapReduce framework by roughly 30 % because it avoids the extra shuffle phase and operates directly on disk‑resident hash tables.
The paper also discusses limitations. Because Roomy still relies on block‑level disk I/O, workloads that require extremely low‑latency random access may experience bottlenecks, especially on traditional HDDs. The current implementation assumes a POSIX‑compatible file system, so metadata overhead can affect performance on heavily fragmented storage. The authors propose future work that includes integration with emerging high‑speed storage fabrics such as NVMe‑over‑Fabric, development of adaptive memory‑disk tiering policies, and support for more sophisticated data structures (e.g., B‑trees) that can further reduce I/O cost.
In summary, Roomy offers a minimally invasive path for developers to extend the effective memory capacity of a cluster by leveraging existing disk resources. By providing a small, well‑defined API and handling all parallelism, communication, and I/O internally, it enables symbolic‑algebra, large‑graph, and data‑mining applications to scale to problem sizes that were previously infeasible on memory‑constrained hardware. The experimental results demonstrate that the overhead of using disks as “virtual RAM” is modest and that the approach can deliver substantial practical benefits in real‑world high‑performance computing environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment