Trivance: Latency-Optimal AllReduce by Shortcutting Multiport Networks
AllReduce is a fundamental collective operation in distributed computing and a key performance bottleneck for large-scale training and inference. Its completion time is determined by the number of communication steps, which dominates latency-sensitive workloads, and the communication distance affecting both latency- and bandwidth-bound regimes. Direct-connect topologies, such as torus networks used in Google’s TPUv4, are particularly prone to large communication distances due to limited bisection bandwidth. Latency-optimal algorithms such as Bruck’s complete AllReduce in $\log_3 n$ steps on a bidirectional ring, but incur large communication distances that result in substantial congestion. In contrast, recent approaches such as Swing reduce communication distance and congestion, but are inherently required to perform $\log_2 n$ steps to complete AllReduce, sacrificing latency-optimality. In this paper, we present Trivance, a novel AllReduce algorithm that completes within $\log_3 n$ steps, while reducing congestion compared to Bruck’s algorithm by a factor of three and preserving bandwidth-optimality. Trivance exploits both transmission ports of a bidirectional ring within each step to triple the communication distance along both directions simultaneously. Furthermore, by performing joint reductions, Trivance improves both the number of steps and network congestion. We further show that Trivance extends naturally to multidimensional torus networks, retaining its latency advantage while achieving performance comparable to bandwidth-optimal algorithms for large messages. Our empirical evaluation shows that Trivance improves state-of-the-art approaches by 5-30% for message sizes up to 8,MiB, in high-bandwidth settings up to 32MiB and for 3D tori up to 128MiB. Throughout the evaluation, Trivance remains the best-performing latency-optimal algorithm.
💡 Research Summary
AllReduce is a cornerstone collective operation in distributed training and high‑performance computing, yet it often dominates execution time, especially on large‑scale systems that employ direct‑connect topologies such as the 3‑D torus used in Google’s TPUv4 pods. The completion time of an AllReduce consists of two main components: the number of communication steps (which determines the latency overhead) and the total amount of data traversing the network (which determines bandwidth usage and congestion). Existing algorithms make a trade‑off between these two factors.
- Latency‑optimal algorithms such as Bruck’s complete AllReduce achieve the theoretical lower bound of ⌈log₃ n⌉ steps on a bidirectional ring, but they route all traffic in a single direction, causing severe link congestion and increasing the effective transmission delay.
- Bandwidth‑optimal algorithms (e.g., Rabenseifner’s Reduce‑Scatter + AllGather) minimize the amount of data sent per node but require at least ⌈log₂ n⌉ steps, sacrificing latency optimality.
- Hybrid approaches like Swing reduce congestion by alternating communication directions, yet they still need ⌈log₂ n⌉ steps and therefore cannot reach the latency lower bound.
The paper introduces Trivance, a novel AllReduce algorithm that simultaneously exploits both ports of a bidirectional ring (or, more generally, the 2 D ports per dimension of a D‑dimensional torus). The key ideas are:
- Triple the communication distance per step – in step k each node sends messages to partners at distance 3ᵏ in both the clockwise and counter‑clockwise directions, using the two ports concurrently.
- Joint reductions – the two incoming messages are reduced together with the local data in a single reduction operation, effectively processing 2·3ᵏ blocks per step. This increases the amount of work done per step, allowing the algorithm to finish in the latency‑optimal ⌈log₃ n⌉ steps while keeping the number of hops per block low.
The authors formalize the cost model using the classic Hockney formulation C(m)=α+β·m, extended to account for congestion:
C(m, A)=∑ₖ (α + β·mₖ·cₖ) where mₖ is the chunk size transmitted in step k and cₖ is the number of overlapping chunks on a link. Trivance reduces cₖ by a factor of three compared with Bruck, because traffic is spread over both directions and the joint reduction prevents multiple independent streams from contending for the same link. Consequently, the transmission‑delay term β·m·Θ approaches the ideal Θ≈1, whereas Bruck suffers Θ≈3.
The algorithm naturally extends to multidimensional torus networks. For a D‑dimensional torus each node has 2 D ports; Trivance runs independent copies of the ring algorithm along each dimension, preserving the ⌈log₃ n⌉ step count and achieving bandwidth‑optimality for large messages.
Experimental evaluation is performed with the Structural Simulation Toolkit (SST), covering 2‑D and 3‑D torus topologies, a wide range of latency‑to‑bandwidth ratios (α/β), and message sizes from 32 B to 128 MiB. Baselines include Recursive Doubling (both latency‑ and bandwidth‑optimal variants), Swing, Bruck, and the Bucket/Hamiltonian‑Ring schemes. Results show:
- For 2‑D torus and messages up to 8 MiB, Trivance reduces completion time by 5 %–30 % relative to the best prior method.
- In high‑bandwidth settings (small β) the advantage extends to 32 MiB.
- On 3‑D torus, Trivance outperforms all state‑of‑the‑art algorithms even for 128 MiB messages, where congestion becomes a dominant factor.
Overall, Trivance demonstrates that latency optimality (⌈log₃ n⌉ steps) and uniform network utilization are not mutually exclusive. By leveraging both ports of bidirectional links and performing joint reductions, it achieves the theoretical lower bound on steps while dramatically lowering congestion and transmission delay. The work provides a practical, high‑performance AllReduce solution for modern large‑scale AI training clusters and HPC systems that employ torus‑like interconnects.
Comments & Academic Discussion
Loading comments...
Leave a Comment