RFP: A Remote Fetching Paradigm for RDMA-Accelerated Systems

RFP: A Remote Fetching Paradigm for RDMA-Accelerated Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Remote Direct Memory Access (RDMA) is an efficient way to improve the performance of traditional client-server systems. Currently, there are two main design paradigms for RDMA-accelerated systems. The first allows the clients to directly operate the server’s memory and totally bypasses the CPUs at server side. The second follows the traditional server-reply paradigm, which asks the server to write results back to the clients. However, the first method has to expose server’s memory and needs tremendous re-design of upper-layer software, which is complex, unsafe, error-prone, and inefficient. The second cannot achieve high input/output operations per second (IOPS), because it employs out-bound RDMA-write at server side which is not efficient. We find that the performance of out-bound RDMA-write and in-bound RDMA-read is asymmetric and the latter is 5 times faster than the former. Based on this observation, we propose a novel design paradigm named Remote Fetching Paradigm (RFP). In RFP, the server is still responsible for processing requests from the clients. However, counter-intuitively, instead of sending results back to the clients through out-bound RDMA-write, the server only writes the results in local memory buffers, and the clients use in-bound RDMA-read to remotely fetch these results. Since in-bound RDMA-read achieves much higher IOPS than out-bound RDMA-write, our model is able to bring higher performance than the traditional models. In order to prove the effectiveness of RFP, we design and implement an RDMA-accelerated in-memory key-value store following the RFP model. To further improve the IOPS, we propose an optimization mechanism that combines status checking and result fetching. Experiment results show that RFP can improve the IOPS by 160%~310% against state-of-the-art models for in-memory key-value stores.


💡 Research Summary

The paper addresses performance limitations of current RDMA‑accelerated client‑server designs. Two dominant paradigms exist: (1) the server‑bypass model, where clients directly read/write server memory via RDMA, completely bypassing the server CPU; and (2) the traditional server‑reply model, where the server processes requests and sends results back using outbound RDMA‑write. The bypass model requires exposing critical server memory, introduces consistency challenges, and demands extensive redesign of higher‑level software, making it unsafe and impractical for general applications. The reply model, while safe, suffers from low IOPS because outbound RDMA‑write is far less efficient than inbound RDMA‑read.

Through micro‑benchmarks on a cluster equipped with Mellanox ConnectX‑3 (40 Gbps) NICs, the authors discover a pronounced asymmetry: inbound RDMA‑read can achieve up to 11.26 million operations per second (MOPS), whereas outbound RDMA‑write peaks at only 2.11 MOPS—a five‑fold difference. The bottleneck stems from the extra state management, request preparation, and acknowledgment handling required for outbound operations, which limits the NIC’s ability to issue many concurrent requests.

Motivated by this observation, the authors propose the Remote Fetching Paradigm (RFP). In RFP, clients still use RDMA‑write to deliver requests to the server. The server processes each request using its CPU and stores the result in a local memory buffer. Crucially, the server does not push the result back; instead, clients actively fetch the result using inbound RDMA‑read. This design offloads the network transmission responsibility from the server to the client, allowing the system to exploit the high‑throughput inbound path of the NIC while avoiding the low‑throughput outbound path.

To validate RFP, the authors implement Jakiro, an in‑memory key‑value store built on the RFP model. They further introduce an optimization that merges result status and payload into a single buffer, enabling a single RDMA‑read to both check completion and retrieve data. This reduces round‑trip overhead and further boosts IOPS.

Experimental evaluation covers uniform, skewed, and dynamic workloads, varying client thread counts and request sizes. Compared with state‑of‑the‑art systems based on the bypass model (e.g., Pilaf, C‑Hint) and the reply model (e.g., FaRM, RDMA‑Memcached), Jakiro achieves 160 %–310 % higher IOPS and more than 50 % lower latency for small items (≈20–32 bytes). The inbound RDMA‑read path scales well with increasing client threads, confirming that the server’s NIC is not a bottleneck under RFP.

The paper concludes that RFP offers a general, safe, and high‑performance alternative for RDMA‑enabled services. While it excels for small‑item workloads, the authors acknowledge that large result payloads could increase client‑side read traffic and buffer management complexity. Future work may explore hybrid schemes that combine RFP with traditional outbound writes for large data, as well as more sophisticated buffer allocation and scheduling mechanisms.

Overall, the work demonstrates that rethinking the direction of RDMA data movement—shifting result delivery from outbound writes to inbound reads—can unlock significant performance gains in modern data‑center environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment