How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, $\text{CReLU}_{τ,m}(x)=\min(\max(x-τ,0),m)$. Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.


💡 Research Summary

**
The paper investigates how the choice of the fixed-point variance (q^{}) in Edge‑of‑Chaos (EoC) initialization influences the training stability of deep neural networks (DNNs) and convolutional neural networks (CNNs) that employ sparsity‑inducing activation functions such as the shifted‑and‑clipped ReLU, denoted CReLU_{τ,m}(x)=\min(\max(x-τ,0),m). Traditional EoC theory fixes (q^{}=1) and tunes the weight variance σ_w² so that the correlation‑map derivative χ₁ equals one, guaranteeing stable forward‑propagation of signals and backward‑propagation of gradients. However, when the activation function forces a large region around the origin to output zero (as with CReLU or its symmetric counterpart CST), the variance map V(q) can develop steep slopes (V′>1) and even a second fixed point, leading to exploding or vanishing hidden‑layer variances q^{(ℓ)} and unstable gradients.

The authors propose to deliberately increase the fixed‑point variance (q^{}) (e.g., to 2 or 3) by scaling the initial weight variance accordingly. They demonstrate analytically that a larger (q^{}) simultaneously reduces the first derivative V′(q^{}) and the curvature V″(q^{}) of the variance map. A smaller V′ brings the system back into the stable regime (V′<1), while a reduced curvature restores symmetry around the fixed point, making the χ₁(q) function less sensitive to stochastic fluctuations of q^{(ℓ)}. This, in turn, limits the product ∏_{ℓ}χ₁(q^{(ℓ)}) that governs the growth of the error second moment \tilde v^{(ℓ)} during back‑propagation, preventing exponential error amplification.

Beyond the infinite‑width limit, the paper incorporates the finite‑width corrections described by Roberts & Yaida (2022). The corrected variance \tilde q^{(ℓ)} and the fourth‑moment r^{(ℓ)} obey recursive relations that involve V′ and V″. By increasing (q^{*}), both the leading‑order correction \tilde q^{{1}} and the fluctuation term r^{(ℓ)} shrink, further stabilizing the actual finite‑width network.

Empirical validation is performed on fully‑connected feed‑forward networks (4–12 layers) and standard CNNs (6–20 layers) trained on MNIST, CIFAR‑10, and Tiny‑ImageNet. The experiments vary the clipping parameter m (1 ≤ m ≤ 2) and the sparsity threshold τ (fixed at 1) while testing three values of (q^{}) (1, 2, 3). Results show that with the conventional (q^{}=1), high sparsity (≥85%) leads to training collapse or severe accuracy loss, especially for larger m. When (q^{*}) is raised to 2 or 3, the same networks achieve up to 90% activation sparsity with less than 1% drop in test accuracy. In CNNs, this translates to roughly 45% reduction in FLOPs and a 30% decrease in simulated energy consumption, because zero activations allow entire rows of weight matrices to be skipped.

The authors discuss practical considerations: larger (q^{*}) implies higher initial weight magnitudes, which can be mitigated by batch normalization and appropriate learning‑rate schedules. The approach is currently limited to architectures without skip connections; residual networks would dilute sparsity because the identity path bypasses the zeroed activations. Transformer models are also left for future work.

In conclusion, the paper establishes that controlling the Gaussian‑process variance (q^{}) is a powerful, under‑explored knob for stabilizing training of sparsely activated networks. By deliberately increasing (q^{}), one can preserve high activation sparsity (up to 90%) while maintaining full or near‑full predictive performance, thereby offering a principled route toward energy‑efficient deep learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment