Sparse Training of Neural Networks based on Multilevel Mirror Descent

Sparse Training of Neural Networks based on Multilevel Mirror Descent
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.


💡 Research Summary

The paper proposes a novel dynamic sparse‑training algorithm that merges linearized Bregman iterations (equivalently, mirror descent with a non‑smooth mirror map) with a multilevel optimization framework. Traditional sparse training either prunes a dense model after training or introduces sparsity during training via explicit ℓ₁ regularization, hard‑thresholding, or dynamic rewiring schemes such as SET or RigL. While effective, these methods often lack rigorous convergence guarantees and can be computationally expensive because they require gradient computation for all parameters at every step.

The authors start from the Bregman iteration formulation: at each iteration a dual variable v is moved along the negative gradient of the loss, then a proximal step with respect to a convex regularizer J (often the ℓ₁ norm) yields the new parameters θ. When J is ℓ₁, the proximal operator becomes the soft‑shrinkage (soft‑thresholding) function, automatically zeroing out small coefficients. This intrinsic pruning makes Bregman iterations attractive for sparse training, but the naïve implementation still processes the full parameter vector.

To address this, the algorithm introduces a “freeze” mechanism. Every m iterations the current sparsity pattern is frozen: only the parameters that are non‑zero are updated (both gradient and proximal steps), while the rest are left untouched. In the intervening iterations a coarse‑level model is invoked. A restriction operator R(k) selects a subset of groups (e.g., whole convolutional kernels) corresponding to the active parameters, reducing the dimensionality from d to Dₖ. The corresponding prolongation operator P(k)=R(k)ᵀ maps the coarse solution back to the full space, filling inactive groups with zeros. By restricting whole groups, the method naturally accommodates both element‑wise ℓ₁ and group ℓ₁,₂ regularizers, enabling structured sparsity.

From a theoretical standpoint, the method is cast as a two‑level multigrid‑inspired optimizer. The fine level performs cheap updates on the active subspace, while the coarse level periodically solves a reduced problem that guides global convergence. Leveraging Nash’s MGOPT framework and the recent convergence analysis of Multilevel Bregman Proximal Gradient Descent (ML‑BPGD) by Elshiaty & Petra (2025), the authors prove a sub‑linear convergence rate under a Polyak‑Łojasiewicz (PL) condition. The proof shows that the error introduced by restricting to a subspace can be bounded and that the coarse‑level corrections guarantee descent of the original objective.

Empirical evaluation is conducted on CIFAR‑10, CIFAR‑100, and a down‑sampled ImageNet benchmark. Compared with the standard linearized Bregman (LinBreg) algorithm, the proposed method reduces the total floating‑point operations (FLOPs) from a 38 % overhead relative to SGD down to only 6 % while preserving test accuracy within 0.2–0.5 % of the dense baseline. It also outperforms other dynamic sparse methods such as SET, RigL, and AC/DC in terms of the sparsity‑accuracy trade‑off, achieving over 90 % parameter sparsity with competitive performance. The experiments further demonstrate that the freeze phases lead to substantial speed‑ups because gradients are computed only for active weights, and that the periodic coarse‑level updates prevent stagnation and improve final accuracy.

Limitations acknowledged include the focus on image classification tasks, the reliance on the PL condition (which may not hold for all deep networks), and the added implementation complexity of managing restriction/prolongation operators and group‑wise selections. Nonetheless, the work establishes a solid bridge between Bregman‑based sparsity promotion and multilevel optimization, offering a principled pathway to train highly sparse neural networks with provable convergence and markedly lower computational cost. Future directions suggested involve extending the framework to other domains (e.g., NLP, reinforcement learning), automating the choice of restriction schedules, and exploring adaptive mirror maps that could further enhance sparsity patterns.


Comments & Academic Discussion

Loading comments...

Leave a Comment