Training LLMs with Fault Tolerant HSDP on 100,000 GPUs

Training LLMs with Fault Tolerant HSDP on 100,000 GPUs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale training systems typically use synchronous training, requiring all GPUs to be healthy simultaneously. In our experience training on O(100K) GPUs, synchronous training results in a low efficiency due to frequent failures and long recovery time. To address this problem, we propose a novel training paradigm, Fault Tolerant Hybrid-Shared Data Parallelism (FT-HSDP). FT-HSDP uses data parallel replicas as units of fault tolerance. When failures occur, only a single data-parallel replica containing the failed GPU or server is taken offline and restarted, while the other replicas continue training. To realize this idea at scale, FT-HSDP incorporates several techniques: 1) We introduce a Fault Tolerant All Reduce (FTAR) protocol for gradient exchange across data parallel replicas. FTAR relies on the CPU to drive the complex control logic for tasks like adding or removing participants dynamically, and relies on GPU to perform data transfer for best performance. 2) We introduce a non-blocking catch-up protocol, allowing a recovering replica to join training with minimal stall. Compared with fully synchronous training at O(100K) GPUs, FT-HSDP can reduce the stall time due to failure recovery from 10 minutes to 3 minutes, increasing effective training time from 44% to 80%. We further demonstrate that FT-HSDP’s asynchronous recovery does not bring any meaning degradation to the accuracy of the result model.


💡 Research Summary

The paper tackles the scalability bottleneck of synchronous training for large language models (LLMs) on clusters of the order of 100 000 GPUs. In a fully synchronous regime, every GPU must be operational for each training step; when a single GPU or server fails, the entire job stalls while the system re‑initializes communication libraries (e.g., NCCL) and restores a checkpoint. Empirical measurements show that at 100 K GPUs a failure occurs roughly every 18 minutes, and the recovery process can take up to 10 minutes, leaving only 8 minutes of productive training per failure interval (≈44 % efficiency).

To overcome this, the authors introduce Fault‑Tolerant Hybrid‑Shared Data Parallelism (FT‑HSDP). FT‑HSDP builds on the Hybrid‑Shared Data Parallelism (HSDP) paradigm, which already partitions a massive model across multiple parallelism dimensions (data, tensor, pipeline, expert, context) and groups thousands of GPUs into a “replica”. FT‑HSDP treats each replica as an independent fault‑tolerance unit. When a failure occurs, only the replica containing the faulty node is taken offline, rebuilt, and re‑joined, while all other replicas continue training uninterrupted.

Realizing this at scale required two novel system components:

  1. Fault‑Tolerant All‑Reduce (FTAR) – A collective communication protocol that separates control and data paths. The CPU orchestrates complex logic such as dynamic participant addition/removal, failure classification, and congestion control, while the GPUs handle the high‑throughput data transfers. FTAR can be invoked in the failure‑free case with performance comparable to NCCL, but unlike NCCL it can reconfigure the communication topology on‑the‑fly without a full restart.

  2. Non‑blocking Catch‑up Protocol – While a failed replica fetches its latest checkpoint, healthy replicas keep processing new mini‑batches. The recovering replica sends a zero‑gradient for the current step, allowing the all‑reduce to bring all replicas to the same model state. Checkpoint fetching is performed via a peer‑to‑peer, load‑balanced scheme that distributes the I/O load across many GPUs, reducing the fetch time to a few tens of seconds. This overlap eliminates the need to pause the whole job and cuts the stall from 10 minutes to about 3 minutes in a 98 K‑GPU deployment.

The authors also describe the underlying network architecture: a multi‑building RoCE fabric with a three‑level Clos topology inside each data center, AI‑Zone based rack grouping, and a fully‑meshed inter‑DC mesh. By placing each replica within a single AI Zone (or data center) they keep intra‑replica collectives on low‑latency links, while inter‑replica data‑parallel collectives tolerate higher latency.

A comprehensive failure‑analysis pipeline is presented. Rich telemetry from NCCL (wait‑for graphs, per‑rank completion times) is collected every few seconds, enabling a root‑cause tool to pinpoint the exact faulty server or GPU in under five seconds, even at 32 K‑GPU scale. The tool was validated with ~1,500 injected failures, achieving 97.8 % exact identification.

Experimental results:

  • On a real 98 K‑GPU training run, FT‑HSDP reduced recovery stall time from 10 min to 3 min, raising effective training time from 44 % to 80 %.
  • In a 256‑GPU testbed, asynchronous recovery did not degrade final model quality; only a modest increase in loss variance was observed, which could be mitigated by applying a square‑root learning‑rate schedule.

The paper’s contributions are: (1) a detailed breakdown of synchronous recovery costs at O(100 K) scale, (2) the FTAR protocol for dynamic, fault‑tolerant collectives, (3) a non‑blocking catch‑up mechanism that minimizes stalls, and (4) empirical evidence that asynchronous recovery does not harm model accuracy.

Overall, FT‑HSDP provides a practical, scalable solution for training next‑generation LLMs on hundreds of thousands of GPUs. By isolating failures at the replica level, leveraging CPU‑driven control for dynamic all‑reduce, and overlapping checkpoint fetching with ongoing training, the approach dramatically improves resource utilization and paves the way for future exascale AI workloads.


Comments & Academic Discussion

Loading comments...

Leave a Comment