Supercharging Packet-level Network Simulation of Large Model Training via Memoization and Fast-Forwarding

Supercharging Packet-level Network Simulation of Large Model Training via Memoization and Fast-Forwarding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Packet-level discrete-event simulation (PLDES) is a prevalent tool for evaluating detailed performance of large model training. Although PLDES offers high fidelity and generality, its slow performance has plagued networking practitioners. Existing optimization techniques either simplify the network model, resulting in large errors; or execute it in parallel using multiple processors, with an upper bound on speedup. This paper explores an alternative optimization direction that reduces the computational loads of PLDES while maintaining high fidelity. Our key insight is that, in distributed LLM training, packet-level traffic behaviors often exhibit repetitive contention patterns and steady-states where flow rates stabilize, ignoring these redundant discrete events speeds up the simulation considerably and the error is negligible. We realize this idea by proposing Wormhole, a user-transparent PLDES kernel capable of automatically memoization for unsteady-states and skipping for steady-states. Wormhole adopts network partitioning, state memoization and reuse, and rate-based steady-state identification to accurately determine the periods of each flow’s steady-state, while maintaining simulation consistency after fast-forwarding. Experiments demonstrate that Wormhole can achieve a 744x speedup over the original ns-3 (510x for MoE workload), with a bounded error of <1%. Applying current multithreading parallel techniques and Wormhole together allows a 1012x speedup, reducing the simulation time for one GPT-13B training under 128 GPUs from 9 hours to 5 minutes.


💡 Research Summary

The paper addresses the long‑standing performance bottleneck of packet‑level discrete‑event simulation (PLDES) when used to evaluate large‑scale language model (LLM) training clusters. While PLDES (e.g., ns‑3‑based simulators) offers the highest fidelity by modeling every packet, queue, loss, and congestion‑control interaction, it suffers from astronomically large event counts (often >10¹²) that make a single training iteration take weeks. Existing acceleration approaches fall into two categories: (1) coarse‑grained models that simplify the network to flow‑level or analytical approximations, incurring 10‑25 % error; and (2) parallel execution on multiple cores or machines, which yields sub‑linear speedups and quickly hits an upper bound (≈10×).

The authors observe two intrinsic properties of LLM training traffic that can be exploited without sacrificing fidelity: (i) Repeated contention patterns – the same set of flows repeatedly contend for the same links during collective operations (All‑reduce, point‑to‑point) across many iterations; (ii) Steady‑state periods – after congestion‑control algorithms converge, flow sending rates become essentially constant, and queue lengths stabilize. By eliminating the simulation of these redundant phases, the computational load can be dramatically reduced.

To this end, they introduce Wormhole, a transparent PLDES kernel built on top of ns‑3. Wormhole operates in three stages. First, a network partitioning algorithm divides the topology into port‑level partitions; each partition contains flows that interact only through a shared set of ports, allowing independent treatment. Second, for unsteady (contention) phases, Wormhole constructs a flow‑contention graph (FCG) as a compact representation of the conflict set and stores the first occurrence of each FCG together with the resulting network state in a memoization database. When the same FCG reappears, the simulator skips the detailed event processing and reuses the stored snapshot, effectively “fast‑forwarding” the unsteady region. Third, Wormhole identifies steady‑state by monitoring the variance of the sending rate over a sliding window; if the variance falls below a configurable threshold, the flow is marked steady. The average rate is then held constant, packet generation at the involved ports is paused (preserving queue occupancy), and timestamps of all pending events are advanced by ΔT, thereby skipping the entire steady interval.

The authors provide a theoretical analysis showing that stable sending rates imply stability of all other flow metrics, and they bound the maximum error introduced by the steady‑state threshold to less than 1 % of flow completion time.

Experimental evaluation spans several LLM workloads: GPT‑3‑175B, GPT‑13B, MoE‑8×7B, MoE‑32×22B, and a real‑world GPT‑18B trace, across clusters of 128‑1024 GPUs on a fat‑tree topology. Wormhole alone achieves a 744× speedup over vanilla ns‑3 (510× on MoE workloads) while keeping the average error below 0.8 %. When combined with the state‑of‑the‑art multithreaded simulator Unison, the compound speedup reaches 1 012× (≈716‑1012× depending on the workload). The memoization hit rate is high because the same contention patterns repeat thousands of times per iteration; steady‑state occupies >97 % of simulation time in dense models, making fast‑forwarding highly effective.

The paper also discusses limitations: in highly dynamic, multi‑tenant cloud scenarios where traffic patterns are less repetitive and steady periods are rare, the benefits diminish and performance reverts to the baseline ns‑3 level. However, even in these worst‑case cases Wormhole incurs no extra computational overhead or accuracy loss.

In summary, Wormhole introduces a novel orthogonal acceleration direction for PLDES by exploiting structural regularities of LLM training traffic—repeated contention and long steady‑state intervals. It delivers multi‑order‑of‑magnitude speedups without compromising the packet‑level fidelity required for accurate network design and optimization of future exascale AI training systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment