FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.
💡 Research Summary
The paper tackles a fundamental bottleneck in large‑scale Mixture‑of‑Experts (MoE) models: blocking communication during distributed training and inference. In typical MoE pipelines, after the router decides which experts to activate, tokens are dispatched to the appropriate expert ranks, processed, and then combined via all‑to‑all communication. Because these communication steps are synchronous, the next layer cannot start until the data has been fully exchanged, leaving accelerators idle and limiting scalability, especially as model sizes and sparsity increase.
To eliminate this idle time, the authors introduce FarSkip‑Collective, a systematic modification of the residual connections that underlie transformer‑style models. Instead of waiting for the exact output of layer k (which may be pending due to a communication collective), the next layer k+1 proceeds using an “available activation” that is either (a) the previous layer’s output (out‑dated) or (b) a partially computed activation that excludes the portion that depends on the pending communication (partial). Concretely, for attention sub‑blocks the partial activation includes the shared‑expert MLP output but omits the routed‑expert contribution; for MoE sub‑blocks the outdated activation simply re‑uses the full activation from the prior layer. This redesign creates a temporal window where the communication of layer k can be overlapped with the computation of layer k+1.
Because removing the most recent expert output from the input could degrade model expressivity, the authors pair the architectural change with a lightweight self‑distillation procedure called FarSkip‑Collective Self‑Distillation (FCSD). FCSD treats the original MoE checkpoint as a teacher and fine‑tunes the modified model using a combination of KL‑divergence on logits and L2 alignment on intermediate hidden states. The distillation uses less than 10 B tokens of high‑quality data, making it inexpensive compared to full pre‑training.
The methodology is evaluated on three state‑of‑the‑art open‑source MoE models spanning 16 B to 109 B parameters: DeepSeek‑V2 Lite (16 B), Qwen‑3‑30B MoE (30 B), and Llama‑4 Scout (109 B). After conversion and FCSD, the models retain accuracy within 2.5 % of their original releases across eleven benchmark datasets, with the 109 B model showing an average drop of less than 1 % relative to the instruction‑tuned baseline. This demonstrates that the “far‑skip” connectivity does not materially harm performance even at frontier‑scale.
Beyond accuracy, the paper delivers concrete system‑level implementations that achieve high communication‑computation overlap. For training, the authors extend Megatron‑LM with asynchronous all‑to‑all collectives and a PyTorch‑API‑level scheduler, reaching 88.4 % overlap of the expert‑parallelism communication (87.6 % forward, 89.0 % backward). For inference, they integrate the technique into vLLM and SGLang, leveraging HIP/CUDA‑graph asynchronous kernels to obtain up to 97.6 % overlap. In practice, the modified Llama‑4 Scout model enjoys an 18.5 % reduction in Time‑to‑First‑Token (TTFT) and comparable throughput gains across multiple GPU/AMD configurations.
The contributions can be summarized as follows:
- Algorithmic Innovation – FarSkip‑Collective redefines residual pathways to eliminate blocking communication in MoE layers while preserving the original parameter layout.
- Distillation Pipeline – FCSD efficiently restores any lost capability using a modest amount of data, making the approach practical for any pre‑existing checkpoint.
- System‑Level Optimizations – Implementations for both training (Megatron‑LM) and inference (vLLM, SGLang) achieve >88 % overlap, translating into measurable speed‑ups without low‑level kernel hacks.
- Empirical Validation at Scale – Experiments on models up to 109 B parameters confirm that accuracy loss is ≤1 % on average, and that the method scales to the largest publicly available MoE LLMs.
The work distinguishes itself from prior “out‑of‑date activation” or “partial activation” studies, which were limited to dense models and tensor‑parallelism, by addressing the full expert‑parallelism stack of sparse MoEs and demonstrating viability across all layers. Limitations include potential diminishing returns when communication latency is shorter than the compute of a single sub‑block, and the need for more aggressive multi‑block skipping strategies for extreme sparsity scenarios. Future directions suggested include exploring deeper “far‑skip” across multiple blocks, topology‑aware routing, and extending the technique to other forms of model parallelism such as pipeline parallelism.
Overall, FarSkip‑Collective provides a compelling, hardware‑agnostic pathway to unlock the next level of efficiency for massive MoE models, making large‑scale sparse training and inference more practical for a broader range of research and production environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment