On Throughput-Delay Optimal Access to Storage Clouds via Load Adaptive Coding and Chunking

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent literature including our past work provide analysis and solutions for using (i) erasure coding, (ii) parallelism, or (iii) variable slicing/chunking (i.e., dividing an object of a specific size into a variable number of smaller chunks) in speeding the I/O performance of storage clouds. However, a comprehensive approach that considers all three dimensions together to achieve the best throughput-delay trade-off curve had been lacking. This paper presents the first set of solutions that can pick the best combination of coding rate and object chunking/slicing options as the load dynamically changes. Our specific contributions are as follows: (1) We establish via measurement that combining variable coding rate and chunking is mostly feasible over a popular public cloud. (2) We relate the delay optimal values for chunking level and code rate to the queue backlogs via an approximate queueing analysis. (3) Based on this analysis, we propose TOFEC that adapts the chunking level and coding rate against the queue backlogs. Our trace-driven simulation results show that TOFEC’s adaptation mechanism converges to an appropriate code that provides the optimal throughput-delay trade-off without reducing system capacity. Compared to a non-adaptive strategy optimized for throughput, TOFEC delivers $2.5\times$ lower latency under light workloads; compared to a non-adaptive strategy optimized for latency, TOFEC can scale to support over $3\times$ as many requests. (4) We propose a simpler greedy solution that performs on a par with TOFEC in average delay performance, but exhibits significantly more performance variations.

💡 Research Summary

Cloud storage services are now a fundamental building block for many modern applications, yet they continue to suffer from high latency variability and reduced throughput under heavy load. Prior work has tackled these problems separately: (i) using erasure coding to issue multiple redundant parallel requests, (ii) dynamically adjusting job sizes (batching) to improve throughput, or (iii) employing a fixed chunking strategy to reduce per‑request overhead. However, no study has jointly considered coding rate, chunk size, and parallelism to achieve the best possible throughput‑delay trade‑off.

This paper fills that gap by (1) measuring the feasibility of combining variable coding rates and chunking on a popular public cloud (Amazon S3), (2) developing an approximate queuing model that links the optimal coding parameters to the current queue backlog, (3) proposing TOFEC (Throughput‑Optimal FEC Cloud), an online algorithm that adapts both the redundancy factor (n/k) and the chunk size (k) based on real‑time backlog observations, and (4) presenting a simpler greedy heuristic for comparison.

The system model consists of a front‑end proxy that receives user read/write requests, translates each request into n storage tasks, and maintains two FIFO queues: a request queue and a multi‑server task queue serviced by L parallel connections to the cloud. A request completes when any k of its n tasks finish. By treating each task’s service time as a random variable and assuming Poisson arrivals, the authors derive a convex relationship between expected service delay and the backlog length. This relationship yields closed‑form expressions for the delay‑optimal (n, k) pair as a function of the current queue length.

TOFEC monitors the request and task queues, consults a pre‑learned distribution of per‑task delays (obtained from trace data), and selects the (n, k) that minimizes the estimated total delay while keeping the system within its capacity region. When the backlog is small, TOFEC chooses a high redundancy (large n) and small chunks to exploit parallelism and achieve low latency. As the backlog grows, it reduces redundancy and increases chunk size, thereby lowering per‑request overhead and preserving throughput. The algorithm operates in O(1) time per decision and quickly converges to the appropriate operating point after workload changes.

A greedy alternative discards the statistical model and simply picks the (n, k) that appears best given the current number of pending tasks. While it matches TOFEC’s average latency in trace‑driven simulations, its latency variance is substantially higher, making it unsuitable for latency‑sensitive services.

The paper also discusses two practical ways to implement variable chunk sizes on cloud storage: (a) Unique‑Key, where each chunk is stored as a separate object, and (b) Shared‑Key, where all coded strips are stored in a single object and partial‑read/partial‑write APIs are used to retrieve or upload specific ranges. Measurements show that the Shared‑Key approach generally provides weak correlation among parallel reads/writes, enabling effective parallelism without the storage overhead of Unique‑Key, although some regions may not exhibit the required weak correlation.

Extensive trace‑driven simulations using real S3 latency traces demonstrate that TOFEC achieves up to 2.5× lower average latency under light workloads compared with a static throughput‑optimized configuration, and can sustain more than 3× the request rate of a static latency‑optimized configuration under heavy load. Importantly, TOFEC does not sacrifice the system’s maximum capacity; it tracks the lower envelope of the throughput‑delay curves for all coding configurations. The greedy heuristic, while competitive in mean latency, suffers from large tail‑latency spikes.

In summary, the authors provide the first comprehensive framework that jointly optimizes erasure coding rate, chunk size, and parallelism in a cloud storage proxy. By grounding the adaptation logic in a queuing‑theoretic analysis and validating it with real‑world measurements, the work offers a practical, deployable solution that can be integrated into existing proxy layers or directly by cloud providers to improve QoS guarantees without requiring changes to the underlying storage service.

On Throughput-Delay Optimal Access to Storage Clouds via Load Adaptive Coding and Chunking

💡 Research Summary

Comments & Academic Discussion

Leave a Comment