DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to $4.83\times$ faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.


💡 Research Summary

Shampoo is a leading approximate second‑order optimizer that captures layer‑wise curvature by maintaining left and right preconditioner matrices (L and R) as exponential moving averages of the gradient’s outer products. The optimizer’s update rule requires computing matrix inverse‑roots (A⁻¹ᐟ² and A⁻¹ᐟ⁴) of these preconditioners, which is the primary computational bottleneck. Traditional implementations rely on eigenvalue decomposition (EVD) or the Coupled‑Newton (CN) iteration; both are expensive on GPUs—EVD has Θ(n³) complexity and poor parallelism, while CN needs several matrix multiplications per iteration and careful scaling for numerical stability.

The paper introduces DASH (Distributed Accelerated SHampoo), which tackles these issues with two complementary innovations. First, instead of processing each B×B block of a preconditioner sequentially, DASH stacks all blocks into a 3‑D tensor (num_blocks × B × B) and processes them in parallel using batched matrix‑matrix multiplication (BMM). This design aligns perfectly with modern Tensor‑Core hardware, dramatically increasing GPU utilization, reducing memory traffic, and enabling half‑precision (FP16) computation with negligible loss of accuracy. The authors also integrate existing Distributed Shampoo features such as grafting, load‑balancing, and FP16 support for the CN path, achieving up to a 5× speed‑up for the optimizer step.

Second, DASH replaces the costly EVD with two fast iterative solvers for inverse roots: the Newton‑Denman‑Beavers (NDB) iteration and a Chebyshev polynomial approximation (CBSHV). NDB simultaneously computes the square root and its inverse, and the authors further prune the first iteration by exploiting closed‑form expressions, saving two matrix multiplications. For the fourth‑root required by Shampoo, NDB is simply applied twice. The Chebyshev approach uses Clenshaw’s algorithm to evaluate a polynomial that approximates A⁻¹ᐟᵖ, offering stable FP16 performance. Both methods rely only on matrix multiplications, making them highly parallelizable.

A critical insight of the work is the analysis of matrix scaling for iterative convergence. Distributed Shampoo historically scales each block by its Frobenius norm ‖A‖_F, an upper bound on the spectral norm that can be 10–100× larger than λ_max(A). This over‑scaling forces the iterative methods to take many more steps to satisfy convergence criteria. DASH introduces a half‑precision multi‑Power‑Iteration routine that efficiently estimates the spectral radius, providing an optimal scaling factor that dramatically reduces the number of required iterations for both NDB and CN.

Empirical evaluation on a 953‑million‑parameter language model trained across 8 GPUs demonstrates that DASH achieves up to 4.83× faster optimizer steps compared with the well‑optimized Distributed Shampoo baseline. In terms of model quality, Newton‑DB yields the lowest validation perplexity per iteration, outperforming both CN and EVD. The Chebyshev solver also matches or exceeds EVD accuracy while benefiting from FP16 speed‑ups. Memory consumption is modestly reduced (71–73 GB per GPU versus 76 GB for the baseline) due to the block‑stacking and load‑balancing strategy.

In summary, DASH bridges the gap between the theoretical advantages of second‑order optimization and practical hardware constraints. By batching block preconditioners and employing fast, GPU‑friendly inverse‑root solvers with proper scaling, it makes high‑quality preconditioning viable for large‑scale deep learning workloads and opens the door for second‑order methods on resource‑constrained platforms. The code is publicly released, encouraging further research and adoption.


Comments & Academic Discussion

Loading comments...

Leave a Comment