Sparse Checkpointing for Fast and Reliable MoE Training

Sparse Checkpointing for Fast and Reliable MoE Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models scale, training them requires thousands of GPUs over extended durations–making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEvement, a distributed, in-memory checkpointing system tailored for MoE models. MoEvement is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of activations and gradients at pipeline-stage boundaries, enabling localized recovery without re-executing unaffected workers. Evaluations across diverse MoE models with up to 64 experts show that MoEvement reduces checkpointing overhead by up to $4\times$ and recovery overhead by up to $31\times$ compared to state-of-the-art approaches, sustaining ETTR $\ge 0.94$ even under frequent failures (MTBF as low as 10 minutes) and delivering up to $8\times$ overall training speedup, all without compromising synchronous training semantics. Overall, MoEvement offers a robust and scalable fault-tolerance solution for the next generation of sparsely activated models.


💡 Research Summary

The paper addresses a critical bottleneck in training large‑scale Mixture‑of‑Experts (MoE) models: fault‑tolerant checkpointing. MoE architectures dramatically increase the total number of parameters by adding hundreds or thousands of “experts,” yet each training step only activates a small subset of them. Traditional checkpointing schemes, designed for dense models, snapshot the entire model state at once. When applied to MoE, this leads to massive I/O and memory pressure, long stalls, and a severe runtime‑recovery trade‑off, especially in environments where the mean time between failures (MTBF) can be as low as ten minutes.

MoEvement proposes three tightly integrated techniques to break this trade‑off. Sparse checkpointing partitions the expert set into multiple subsets and snapshots a different subset on each iteration. By prioritizing experts based on activation frequency, the system ensures that frequently used experts are checkpointed more often while still keeping per‑iteration checkpoint data to a fraction (≈1/N) of the full model size. This allows checkpoint copy and persist phases to be fully overlapped with forward/backward computation, eliminating stalls.

The second component, sparse‑to‑dense conversion, stores full‑precision (FP32) weights for the experts included in the current sparse checkpoint and low‑precision (FP16) weights for the rest. During recovery, the system first loads the FP16 state, then incrementally restores the FP32 experts, recomputing only the necessary iterations until a consistent dense checkpoint is reconstructed. This staged conversion preserves synchronous training semantics and model accuracy while avoiding the global rollback required by conventional approaches.

The third innovation, upstream logging, records activations and gradients at pipeline‑stage boundaries. When a failure occurs, only the data‑parallel group that actually lost progress rolls back to its most recent sparse checkpoint (typically just a few iterations old). The rest of the workers continue uninterrupted. By leveraging the logged intermediate tensors, the recovering group can complete the sparse‑to‑dense conversion locally, cutting expected recovery time by up to 31× relative to state‑of‑the‑art methods such as CheckFreq and Gemini.

Implemented on top of DeepSpeed, MoEvement was evaluated on a variety of MoE models spanning language and vision domains, with expert counts ranging from 8 to 64. Key results include: (1) checkpointing overhead reduced up to 4× compared with MoC‑System and comparable to Gemini for dense models; (2) recovery overhead reduced up to 31× versus CheckFreq and 17× versus Gemini; (3) Effective Training Time Ratio (ETTR) remaining ≥ 0.94 even when MTBF drops to 10 minutes, translating into substantial cost savings in large‑scale cloud training; (4) no degradation in final model accuracy, confirming that the mixed‑precision (FP32 master weights, FP16 compute) and expert consistency are maintained.

The paper also provides a comprehensive comparison table (Table 1) showing that MoEvement uniquely satisfies all desirable properties: frequent checkpointing, low runtime overhead, fast recovery, and correctness guarantees. The authors discuss limitations and future work, including scaling to models with thousands of experts and extending the approach to ultra‑low‑precision training (e.g., FP8). Overall, MoEvement delivers a robust, scalable fault‑tolerance solution tailored to the sparsely activated nature of modern MoE models, enabling faster, more reliable training at the scale required for next‑generation foundation models.


Comments & Academic Discussion

Loading comments...

Leave a Comment