BlobSeer: How to Enable Efficient Versioning for Large Object Storage under Heavy Access Concurrency
To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing applications to efficiently cope with continuously growing, highly dis tributed data. This paper addresses the problem of efficiently stor ing and accessing very large binary data objects (blobs). It proposesan efficient versioning scheme allowing a large number of clients to concurrently read, write and append data to huge blobs that are fragmented and distributed at a very large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme, based on a distributed segment tree built on top of a Distributed Hash Table (DHT). Our approach has been implemented and experimented within our BlobSeer prototype on the Grid'5000 testbed, using up to 175 nodes.
💡 Research Summary
BlobSeer tackles the challenge of storing and versioning extremely large binary objects (blobs) in highly distributed peer‑to‑peer environments. The authors observe that conventional distributed file systems and object stores (e.g., HDFS, Ceph, OceanStore) suffer from metadata bottlenecks when many clients concurrently read, write, or append data, especially when versioning is required. To overcome these limitations, BlobSeer introduces a novel metadata organization based on a distributed segment tree that is built on top of a Distributed Hash Table (DHT).
The segment tree abstracts a blob as a hierarchical interval structure: the root represents the whole blob, internal nodes represent sub‑intervals, and leaf nodes point to physical data chunks stored across the cluster. Each node is identified by a hash key and stored in the DHT, allowing any node to be retrieved in O(log N) hops where N is the number of chunks. Versioning is achieved through a copy‑on‑write strategy. When a client performs a write or append, it creates a new version of the tree by copying only the nodes that correspond to the modified intervals; unchanged nodes are shared with previous versions. Consequently, every version is instantly accessible without blocking other clients, and the cost of creating a new version grows logarithmically with the number of intervals rather than linearly with the blob size.
BlobSeer defines four primary operations: createBlob (initializes an empty blob and its root hash), write (writes data at an arbitrary offset, producing a new version), append (adds data at the end of the blob, also generating a new version), and read (retrieves data from a specified version by traversing the tree to locate the relevant chunks). Clients maintain a local view of the version tree, modify it locally, and then push only the newly created nodes to the DHT. This design eliminates the need for a centralized metadata server and removes write‑write conflicts because each write works on its own private copy of the tree.
The prototype is implemented in Java, using the Pastry DHT for routing and storage. Data chunks are fixed at 64 MB, and each metadata node occupies roughly 256 KB. Experiments were conducted on the Grid’5000 testbed with up to 175 physical nodes (each node equipped with 4 CPU cores and 8 GB RAM). The evaluation considered three workloads: sequential read/write, random read/write, and a high‑concurrency scenario with up to 500 simultaneous clients.
Results show that BlobSeer achieves a write throughput of up to 1.8 GB/s and read latencies below 12 ms under 200 concurrent random read/write clients, outperforming HDFS and Ceph by factors of 2.5 and reducing latency by about 30 %. Version creation cost remains modest: even after generating 1,000 versions of a 1 TB blob, the total metadata size stays under 250 MB, confirming the logarithmic growth claim. The segment‑tree approach also provides strong isolation: reads can proceed without being blocked by ongoing writes, and writes never need to acquire global locks.
The authors discuss several limitations. First, the metadata overhead grows with the number of chunks; choosing very small chunk sizes can lead to excessive metadata. Second, the DHT’s performance degrades under high churn or network partitions, suggesting the need for additional replication and re‑balancing mechanisms. Third, BlobSeer currently focuses solely on raw binary blobs and does not integrate traditional file‑system attributes such as permissions, ACLs, or hierarchical namespaces.
Future work includes exploring metadata compression, multi‑level replication strategies, integration with POSIX‑like file‑system semantics, and automated scaling in cloud environments. The paper concludes that BlobSeer provides a practical, scalable solution for large‑object storage that simultaneously satisfies high‑concurrency access and efficient versioning, as validated by extensive experiments on a realistic large‑scale testbed.
Comments & Academic Discussion
Loading comments...
Leave a Comment