Mitigating Shared Storage Congestion Using Control Theory

Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performan

Mitigating Shared Storage Congestion Using Control Theory

Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...