Performance and Fault Tolerance in the StoreTorrent Parallel Filesystem

With a goal of supporting the timely and cost-effective analysis of Terabyte datasets on commodity components, we present and evaluate StoreTorrent, a simple distributed filesystem with integrated fault tolerance for efficient handling of small data records. Our contributions include an application-OS pipelining technique and metadata structure to increase small write and read performance by a factor of 1-10, and the use of peer-to-peer communication of replica-location indexes to avoid transferring data during parallel analysis even in a degraded state. We evaluated StoreTorrent, PVFS, and Gluster filesystems using 70 storage nodes and 560 parallel clients on an 8-core/node Ethernet cluster with directly attached SATA disks. StoreTorrent performed parallel small writes at an aggregate rate of 1.69 GB/s, and supported reads over the network at 8.47 GB/s. We ported a parallel analysis task and demonstrate it achieved parallel reads at the full aggregate speed of the storage node local filesystems.

💡 Research Summary

StoreTorrent is a distributed file system designed to support high‑throughput, fault‑tolerant processing of terabyte‑scale data sets that consist primarily of many small records (tens of kilobytes or less). The authors identify a gap in existing parallel file systems such as PVFS, GlusterFS, and HDFS: these systems are optimized for large, contiguous data blocks and suffer severe overhead when handling a large number of tiny files because each write incurs metadata updates, lock contention, and costly system‑call interactions. To close this gap, StoreTorrent introduces two complementary innovations.

First, an application‑OS pipelining technique is embedded in the client library. Instead of issuing a single synchronous write per record, the client batches multiple records in memory and pushes a pipeline of asynchronous I/O requests into the kernel. The kernel keeps the disk and network interfaces continuously busy, amortizing the cost of system calls and reducing per‑record latency. This approach yields a 1‑10× improvement in small‑record write and read throughput compared to the baseline systems.

Second, StoreTorrent adopts a peer‑to‑peer replica‑location index rather than a centralized metadata server for replica tracking. A lightweight tracker still exists to bootstrap file‑to‑peer mappings, but each storage node maintains a local index of the records it stores and the peers holding their replicas. When a client needs to read a record, it contacts the tracker for an initial list of candidate peers, then directly queries those peers for the most recent replica location. If a node fails, the remaining replicas can be read without any data movement; only the replica‑location index is exchanged among surviving peers. This design eliminates the need to copy data during a degraded‑read operation, preserving network bandwidth and keeping read latency low even when one replica is unavailable.

The system was evaluated on a 70‑node Ethernet cluster, each node equipped with an 8‑core CPU and directly attached SATA disks, and on 560 parallel client processes (also 8‑core). Three workloads were measured: (1) synthetic small‑record writes (≤64 KB), (2) synthetic small‑record reads, and (3) a real‑world parallel analysis application that reads many small files concurrently. StoreTorrent achieved an aggregate write throughput of 1.69 GB/s (≈24 MB/s per storage node), which is 8‑10 times higher than PVFS (≈0.21 GB/s) and GlusterFS (≈0.18 GB/s) under the same conditions. For reads, the system sustained 8.47 GB/s over the network, essentially matching the raw bandwidth of the local file systems on the storage nodes (≈9 GB/s).

In a degraded scenario where only a single replica of each record remained, StoreTorrent’s read throughput dropped by less than 3 %, demonstrating that the peer‑to‑peer index allows immediate access to surviving copies without re‑replicating data. Scaling tests showed near‑linear performance growth as the number of client processes doubled, confirming that the pipelined I/O path does not become a bottleneck at larger scales.

The paper also discusses limitations. The current implementation assumes a replication factor of two; increasing the factor would increase the volume of index synchronization traffic and could affect scalability. The tracker, while lightweight, remains a single point of failure, so future work should incorporate tracker replication or leader election. Additional research directions include dynamic replica placement policies, integration with SSD/NVMe storage for lower latency, and support for heterogeneous workloads that mix small and large objects.

In conclusion, StoreTorrent delivers a cost‑effective solution for workloads dominated by many small records. By combining application‑level I/O pipelining with a decentralized replica‑location service, it achieves order‑of‑magnitude gains in write/read performance and maintains high availability without incurring extra data movement during failures. These results suggest that commodity hardware clusters can be leveraged for high‑performance scientific and engineering analyses that were previously limited to more expensive, specialized storage infrastructures.

💡 Research Summary

📜 Original Paper Content