HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models
Ensuring resilience in distributed systems has become an acute concern. In today’s environment, it is crucial to develop light-weight mechanisms that recover a distributed system from faults quickly and with only a small impact on the live-system throughput. To address this need, this paper proposes a new low-overhead, general recovery scheme for modern non-transactional leaderless distributed systems. We call our scheme HEAL. On a node failure, HEAL performs an optimized online incremental recovery. This paper presents HEAL’s algorithms for settings with Linearizable consistency and different memory persistency models. We implement HEAL on a 6-node Intel cluster. Our experiments running TAOBench workloads show that HEAL is very effective. HEAL recovers the cluster in 120 milliseconds on average, while reducing the throughput of the running workload by an average of 8.7%. In contrast, a conventional recovery scheme for leaderless systems needs 360 seconds to recover, reducing the throughput of the system by 16.2%. Finally, compared to an incremental recovery scheme for a state-of-the-art leader-based system, HEAL reduces the average recovery latency by 20.7x and the throughput degradation by 62.4%.
💡 Research Summary
The paper addresses the growing need for fast, low‑overhead fault recovery in modern distributed storage systems, especially those that are leader‑less and non‑transactional. Existing leader‑based solutions such as ZooKeeper rely on a central log and a “reactive” recovery process that incurs high latency, redundant update transmission, and forces the recovering node to stay idle during recovery. Leader‑less systems like Hermes avoid a single point of control but recover by copying the entire key‑value store, leading to recovery times measured in minutes or even hours for large datasets.
HEAL (Healing Efficient Adaptive Layer) is introduced as the first online incremental recovery scheme that works in a leader‑less environment. Its design rests on three core techniques:
-
Proactive Recovery – When a node fails, the system continuously tracks the updates that the failed node missed. Instead of waiting for a request, the coordinator immediately pushes only those missed updates to the recovering node.
-
Redundancy Elimination – The update batch sent to the recovering node is deduplicated so that only the latest version of each record is transmitted, dramatically reducing network traffic and persistence work.
-
Active Participation – The recovering node remains a full participant in the ongoing write protocol (both as coordinator and follower), allowing it to apply new updates while it is still catching up, which speeds up overall recovery.
The authors formalize the interaction of Linearizable consistency with five persistence models—Synchronous, Strict, Read‑Enforced, Eventual, and Scope—within the Distributed Data Persistency (DDP) framework. Each model is expressed through distinct ACK and VAL message flags (e.g., ACK_C, ACK_P, VAL_C, VAL_P) that convey both consistency and durability information. This unified treatment enables HEAL to operate correctly under a wide range of consistency‑persistence trade‑offs.
Implementation details: HEAL is implemented on a six‑node Intel Xeon cluster (7 cores per node). The baseline leader‑less protocol follows Hermes, using Lamport logical timestamps (<node‑id, version>) to order writes without a central leader. For recovery, each node maintains an “update buffer” that records the latest version of each key together with its timestamp. When a failure is detected, the buffer of a healthy node is scanned, redundant entries are removed, and the minimal set of missed updates is streamed to the recovering node via RDMA.
Evaluation: The authors run TAOBench workloads that emulate realistic read‑write mixes and vary the database size from 1 GB to 256 GB. Under the <Linearizable, Synchronous> configuration, HEAL achieves an average recovery latency of 120 ms, compared with 360 seconds for Hermes’s full‑copy approach. Throughput degradation during recovery is only 8.7 % on average, whereas Hermes suffers a 16.2 % drop and ZooKeeper’s incremental recovery incurs over 23 % degradation. Across all five persistence models, HEAL’s latency remains below 150 ms, demonstrating that the extra durability guarantees do not significantly impact recovery speed. The authors also show that the deduplication step reduces the amount of data transferred by up to 90 % for workloads with high write contention.
The paper further explains why leader‑based recovery techniques cannot be directly transplanted into a leader‑less setting: (a) leader‑less systems must handle concurrent coordinators updating the same record, requiring version‑based conflict resolution; (b) there is no single log to query for missing updates; (c) failure handling must treat every node as both coordinator and follower, eliminating the simple “replace the failed follower” pattern used in leader‑based systems.
In conclusion, HEAL demonstrates that online incremental recovery is feasible and highly efficient in leader‑less distributed systems. By proactively pushing a minimal, deduplicated set of missed updates and keeping the recovering node active in the protocol, HEAL reduces recovery latency by more than 20× and cuts throughput impact by over 60 % compared with the best existing alternatives. The design is generic enough to be adapted to other leader‑less or hybrid architectures, and the authors suggest future work on scaling to larger clusters, integrating with emerging persistent memory technologies, and extending the DDP framework to additional consistency models.
Comments & Academic Discussion
Loading comments...
Leave a Comment