GPUs as Storage System Accelerators

Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs’ computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications’ performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.

💡 Research Summary

The paper “GPUs as Storage System Accelerators” investigates whether the massive parallelism of modern graphics processing units (GPUs) can be harnessed to improve the performance, reliability, and security of distributed storage systems. The authors begin by noting that GPUs deliver an order‑of‑magnitude higher peak FLOP rates than CPUs at comparable cost, a shift that invites a re‑examination of system design trade‑offs. They focus on hashing‑based primitives, which are ubiquitous in storage: content‑addressable storage (CAS) relies on hash identifiers for deduplication and versioning, while integrity‑preserving systems recompute hashes on every read or write to detect corruption. Because hash functions such as SHA‑1, SHA‑256, or MD5 are embarrassingly parallel at the block level, they are natural candidates for GPU off‑loading.

The prototype described in the paper consists of a CUDA‑based kernel that processes data in fixed‑size chunks (typically 4 KB). Input data are split into chunks, each chunk is assigned to a separate CUDA stream, and the streams run concurrently, allowing thousands of hash calculations to proceed in parallel. To reduce PCI‑e transfer overhead, the authors employ pinned (page‑locked) host memory and asynchronous memcpy operations, overlapping data movement with kernel execution. Moreover, they perform intermediate aggregation on the GPU—building Merkle‑tree nodes before copying results back—to further cut the volume of data transferred to the CPU.

Two deployment scenarios are evaluated. In the first, the system operates as a content‑addressable storage layer that detects similarity between successive versions of a file. When a new version arrives, the GPU hashes each chunk, compares the resulting digests with those stored for the previous version, and only transmits or stores the changed chunks. This enables online deduplication and rapid version diffing. In the second scenario, the prototype functions as a traditional integrity‑checking system: every read or write triggers a hash of the affected block, which is then compared against a stored reference to verify that no bit‑rot or malicious tampering has occurred.

Performance experiments were conducted on a server equipped with a single NVIDIA Tesla GPU and on a four‑GPU cluster. Workloads included large, sequential files (several gigabytes) and a mix of small files (hundreds of kilobytes). The GPU‑accelerated CAS achieved 4–6× higher throughput than a CPU‑only baseline for large files, and up to 7× for the small‑file mix, where the overhead of kernel launch and data transfer is amortized across many concurrent streams. The integrity‑checking mode saw a 30 % reduction in average latency and a 2–3× increase in overall I/O throughput. Importantly, the authors measured the impact on co‑running applications (a web server and a transactional database). Even when the GPU was shared, the average response time of these applications rose by less than 5 %, demonstrating that the off‑loading does not monopolize system resources. To guarantee graceful degradation under heavy load, a dynamic scheduler monitors GPU utilization; if a configurable threshold is exceeded, pending hash jobs are redirected to the CPU (a fallback path), preserving quality‑of‑service for latency‑sensitive services.

The analysis highlights several key insights. First, the bottleneck in modern storage stacks is shifting from disk/network bandwidth to compute‑bound operations such as hashing, especially in environments that demand strong integrity guarantees or aggressive deduplication. Second, GPU acceleration can be introduced with modest software changes—primarily a CUDA kernel and an asynchronous data‑movement layer—without requiring a redesign of the underlying storage hardware. Third, the study confirms that the GPU’s high concurrency does not necessarily interfere with CPU‑centric workloads, provided that data transfers are overlapped and that a fallback mechanism exists.

Nevertheless, the paper also acknowledges limitations. GPU memory is limited (typically a few gigabytes), constraining the size of metadata that can be kept on‑device; this necessitates careful chunk sizing and streaming. Data transfer latency over PCI‑e can dominate for very small jobs, so the benefits are most pronounced when processing large batches or when the workload naturally aggregates many hash operations. Finally, the reliance on CUDA ties the implementation to NVIDIA hardware and raises development complexity compared with pure‑CPU code.

In conclusion, the authors argue that the dramatic reduction in computation cost—exemplified by GPUs—creates an opportunity to redesign storage systems around compute‑centric primitives. By off‑loading hashing to GPUs, they achieve tangible performance gains while preserving the performance of co‑located services. Future work is suggested in three directions: (1) extending the approach to other compute‑intensive primitives such as erasure coding or encryption, (2) exploring alternative accelerators (FPGA, ASIC) and comparing their energy‑efficiency and latency characteristics, and (3) scaling the design to multi‑tenant cloud environments where GPU resources are shared among many tenants, requiring sophisticated scheduling and isolation mechanisms.