Parallel Dither and Dropout for Regularising Deep Neural Networks
Effective regularisation during training can mean the difference between success and failure for deep neural networks. Recently, dither has been suggested as alternative to dropout for regularisation during batch-averaged stochastic gradient descent (SGD). In this article, we show that these methods fail without batch averaging and we introduce a new, parallel regularisation method that may be used without batch averaging. Our results for parallel-regularised non-batch-SGD are substantially better than what is possible with batch-SGD. Furthermore, our results demonstrate that dither and dropout are complimentary.
💡 Research Summary
This paper investigates the limitations of two popular regularisation techniques—dropout and dither—when applied to stochastic gradient descent (SGD) without batch averaging. The authors demonstrate experimentally that both methods rely on the smoothing effect of batch‑averaged gradients; in pure, non‑batch SGD the injected noise or random unit removal leads to unstable updates and a noticeable drop in test accuracy. To overcome this dependency, the authors propose a novel “parallel regularisation” framework. The core idea is to replicate each training sample into several independent mini‑batches, apply dither (high‑frequency additive noise) and dropout (random neuron silencing) separately to each replica, compute the gradients for each replica, and then average those gradients before updating the model parameters. This procedure effectively recreates the stabilising influence of batch averaging while allowing the two regularisers to act concurrently on distinct copies of the data. Empirical evaluation on standard benchmarks (MNIST, CIFAR‑10, and a small text classification set) shows that parallel regularisation consistently outperforms the conventional batch‑SGD combination of dropout and dither, achieving 1–2 % higher classification accuracy. Moreover, the joint use of dither and dropout yields a synergistic effect: the combined approach surpasses the performance of either technique applied alone. Importantly, the method exhibits reduced sensitivity to batch size, maintaining robust learning even with very small batches or in a truly online setting where no batch is formed. The authors also discuss computational overhead, noting that the extra forward‑backward passes can be parallelised on modern GPUs/TPUs, making the approach practical for real‑time or memory‑constrained applications. The paper concludes by suggesting future work that integrates parallel regularisation with other normalisation schemes (batch‑norm, layer‑norm) and extends the technique to diverse architectures such as transformers and graph neural networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment