ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

Reading time: 5 minute
...

📝 Abstract

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

💡 Analysis

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

📄 Content

In recent years, Transformer-based large language models (LLMs) have demonstrated remarkable performance not only in text-related tasks but also in cross-domain applications like genomic tasks due to their exceptional parallel computing capability and scalability. Models like GPT (OpenAI 2024) leverage self-attention to capture longdistance semantic relationships, with context windows expanding from 512 to 128K tokens, thereby improving performance in long-context tasks, including document understanding and programming assistance. Growing sequence lengths exponentially increase the computational and memory complexity O(n 2 ), creating significant challenges for LLM training.

Parallelism strategies partition LLM states and intermediate results into distributed devices, providing additional computational and memory resources for scaled model training. Recent researches improve training efficiency by partitioning in the sequence dimension, such as Sequence Parallelism (Li et al. 2023) and Ulysses (Jacobs et al. 2023), which have shown great improvements in training throughput. However, when processing documents that exceed 128K tokens, the extremely long sequences often lead to out-ofmemory (OOM) failures during model training. To achieve better memory savings, researchers establish fine-grained optimized sequence parallelisms, such as METP (Liang et al. 2025). It can significantly decrease the memory consumption of devices, making it suitable for training extremely long sequences. However, METP raises communication overhead as a trade-off to training efficiency, thereby negating performance gained by distributed parallelism. The communication-parallelization cancellation (CPC) becomes more severe when memory-saving parallel strategies are applied to short sequences, which bring more frequent communications during LLM training. To conclude, efficient strategies perform optimally for short sequences, while memory-saving strategies are suited for long sequences.

Existing LLM training frameworks, such as Megatron-LM (Shoeybi et al. 2020;Narayanan et al. 2021;Korthikanti et al. 2023), typically offer multiple parallel strategies for options, including tensor parallelism (TP), sequence parallelism (SP), etc (Tang et al. 2025;Liang et al. 2023). These frameworks significantly enhance the capabilities of large-model training. Strategic parallelism selection enables optimal efficiency-memory trade-off for given workloads. By selecting appropriate parallel strategies for the training dataset, they can achieve a balanced trade-off between training efficiency and memory consumption. However, ideally assuming the training workloads are static across different samples, these frameworks mostly employ a fixed parallel strategy for all Transformer layers throughout the entire training process. Therefore, they fail to adapt to real-world input sequences that range from queries consisting of several tokens to those comprising millions of tokens, remaining unable to resolve OOM failures or CPC issues.

As a promising alternative, HotSPa (Ge et al. 2024) pioneers hot-switching at the mini-batch level for different sequence lengths across parallel strategies, integrating unified graph compilation and communication-aware scheduling. However, its coarse-grained switching mechanism, while supporting sequences up to 32K tokens, exhibits incompatibility with modern memory-efficient parallel strategies. This calls for a training framework that incorporates both novel and conventional parallelism to support extremely long sequences exceeding 300K tokens, adaptively selects parallel strategies for dynamic sequences, and achieves fine-grained switching.

To address it, this paper presents ParaDySe, a Parallelstrategy switching approach for Dynamic Sequences in Transformer models. ParaDySe enables layer-wise adaptive parallel training based on real-time sequence lengths, achieving seamless strategy switching without tensor redistribution or communication synchronization. Specifically, our contributions can be summarized as follows:

• We propose ParaDySe, a novel framework enabling adaptive parallel strategy switching for dynamic sequences, which achieves sequence length support up to 624K tokens while significantly improving training efficiency. • Through the design of modular function libraries based on tensor layout specifications, ParaDySe eliminates tensor redistribution overhead, thereby enabling seamless strategy switching. • Through sequence-aware cost models for time and memory, ParaDySe achieves on-the-fly optimal layer-wise strategy selection adapted to dynamic sequences, with a balanced trade-off between training efficiency and memory consumption.

The symbolic notations used in this paper are defined in Table 1. Transformer architecture has two core operations (q ∈ Q), the Multi-Head Attention (MHA) and the Feed-Forward Network (FFN) operation, whose computational process can be formally expressed as Equations (1-3) and are

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut