Fault-tolerant Reduce and Allreduce operations based on correction

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Implementations of Broadcast based on some information dissemination algorithm – e.g., gossip or tree-based communication – followed by a correction algorithm has been proposed previously. This work describes an approach to apply a similar idea to Reduce. In it, a correction-like communication phase precedes a tree-based phase. This provides a Reduce algorithm which is tolerant to a number of failed processes. Semantics of the resulting algorithm are provided and proven. Based on these results, Broadcast and Reduce are combined to provide Allreduce.

💡 Research Summary

The paper addresses the problem of making collective communication primitives—specifically Reduce and Allreduce—robust against process failures in high‑performance computing environments. Traditional tree‑based Reduce algorithms are vulnerable: if a single process fails, the entire subtree rooted at that process is cut off, preventing its descendants from contributing their data. To overcome this limitation, the authors propose a two‑phase approach that inserts an “up‑correction” phase before the usual tree reduction.

In the up‑correction phase, the system assumes that at most f processes may fail (either before the operation starts or during its execution). All processes are partitioned into groups of size f + 1 based on their position in the logical reduction tree; the last group may include the root if it would otherwise be undersized. Within each group, every alive process sends its local input value to all other members and locally reduces the received values using the associative, commutative reduction operator. Failed processes simply do not send messages; the remaining members detect the failure via a timeout and continue with the values they have received. After this phase each process holds a “corrected” local value v that already incorporates the contributions of all members of its group (or, if the group contains a failure, the value may or may not include the failed process’s input—both outcomes are allowed by the semantics).

The second phase is the conventional tree‑based Reduce. Each non‑root process receives reduced values from its children, combines them with its own corrected value v, and forwards the result to its parent. The root receives one value from each child subtree; based on failure information it can decide whether the received value already contains the root’s own contribution, whether it needs to add it, or whether it must ignore the values of the last group if the root is not part of that group. The paper formalizes five semantic properties that guarantee correctness: (1) if the root calls deliver_reduce, all non‑failed processes have called init_reduce; (2) each process calls deliver_reduce at most once; (3) the root’s result includes the inputs of all non‑failed processes; (4) inputs of failed processes may be included or omitted, but no intermediate states are observable; (5) every non‑failed process eventually calls deliver_reduce provided all non‑failed processes called init_reduce.

The authors illustrate the algorithm with a concrete example of seven processes where process 1 fails and f = 1. Groups of size 2 are formed: (3,4) and (5,6) exchange values, while process 0 (the root) is not in any group because the total number of non‑root processes is divisible by f + 1. After up‑correction, processes 3 and 4 hold the value 7, processes 5 and 6 hold 11, and processes 0 and 2 retain their original values. The subsequent tree phase propagates these corrected values, allowing process 2 to combine 7, 11, and 2 into 20, which is then sent to the root. The root adds its own value 0 and obtains the correct sum of all surviving processes (0 + 2 + 3 + 4 + 5 + 6 = 20).

Allreduce is built by composing a fault‑tolerant Broadcast (based on prior gossip‑plus‑correction work) with the fault‑tolerant Reduce described above. Consequently, every process receives the final reduced value even in the presence of up to f failures.

Performance considerations focus on small‑message, latency‑critical scenarios. The up‑correction phase adds O(f + 1) messages per group and incurs timeout latency when failures occur, but this overhead is modest compared to the complete failure of a subtree in a plain tree Reduce. For large messages, bandwidth‑optimal algorithms (e.g., pipelined trees) may be preferable, but the proposed method remains correct and competitive when latency dominates.

In summary, the paper introduces a novel “pre‑correction” technique that transforms a failure‑sensitive tree Reduce into a fault‑tolerant operation without requiring global synchronization or probabilistic guarantees. By carefully structuring groups, using deterministic timeouts, and defining clear semantic contracts, the authors provide both a rigorous correctness proof and a practical algorithm that can be directly integrated into MPI‑style collective libraries. Future work may explore dynamic group formation, asynchronous execution, and scaling to larger failure thresholds.

Fault-tolerant Reduce and Allreduce operations based on correction

💡 Research Summary

Comments & Academic Discussion

Leave a Comment