Sparseout: Controlling Sparsity in Deep Networks
Dropout is commonly used to help reduce overfitting in deep neural networks. Sparsity is a potentially important property of neural networks, but is not explicitly controlled by Dropout-based regularization. In this work, we propose Sparseout a simple and efficient variant of Dropout that can be used to control the sparsity of the activations in a neural network. We theoretically prove that Sparseout is equivalent to an $L_q$ penalty on the features of a generalized linear model and that Dropout is a special case of Sparseout for neural networks. We empirically demonstrate that Sparseout is computationally inexpensive and is able to control the desired level of sparsity in the activations. We evaluated Sparseout on image classification and language modelling tasks to see the effect of sparsity on these tasks. We found that sparsity of the activations is favorable for language modelling performance while image classification benefits from denser activations. Sparseout provides a way to investigate sparsity in state-of-the-art deep learning models. Source code for Sparseout could be found at \url{https://github.com/najeebkhan/sparseout}.
💡 Research Summary
The paper introduces Sparseout, a stochastic regularization technique that extends Dropout by allowing explicit control over the sparsity of neural network activations. While Dropout randomly zeros out units to reduce co‑adaptation, it does not directly influence the distribution of activation magnitudes. Sparseout modifies each activation aₗ,i according to the rule
âₗ,i = aₗ,i ± |aₗ,i|^{q/2}·(r_i − 1)·p^{‑1},
where r_i is a Bernoulli(p) mask and q is a hyper‑parameter that determines the L_q norm imposed on the activations. When q < 2 the L_q space is sparse, encouraging many near‑zero activations; when q > 2 the space is dense, pushing activations toward a more uniform magnitude distribution. The authors prove that Sparseout is mathematically equivalent to adding an L_q penalty on the features of a generalized linear model, and they show that for non‑negative activations (e.g., ReLU) the special case q = 2 reduces exactly to standard Dropout.
The theoretical analysis proceeds by examining the variance introduced by the perturbation of the design matrix in a generalized linear model. The variance term becomes (1 − p)/p · |X_{ij}|^{q} · β_j², which aggregates into a regularization term proportional to ‖ΓX‖_q^q. This establishes a direct link between the stochastic mask and an implicit L_q norm penalty on the network’s hidden representations.
Empirically, the authors evaluate Sparseout on several benchmarks. First, a simple auto‑encoder on MNIST demonstrates that varying q indeed changes the Hoyer sparsity measure of hidden activations: lower q values (≈1) produce high sparsity (measure ≈0.6), while higher q values (≈3–4) yield much lower sparsity (≈0.2). Second, they compare computational overhead: on a two‑layer auto‑encoder (512 and 1024 hidden units) Sparseout incurs only a marginal increase over Dropout (≈5.8 s per epoch versus 5.3 s), whereas Bridgeout—an earlier stochastic method that also encourages sparsity—is an order of magnitude slower (≈30–60 s per epoch). This confirms that Sparseout can be implemented with minimal changes to existing Dropout pipelines and can leverage highly optimized cuDNN kernels.
For image classification, the authors integrate Sparseout into a Wide Residual Network (WRN‑28‑10) and train on CIFAR‑10 and CIFAR‑100. Results show that q > 2 (specifically q = 2.5) consistently outperforms standard Dropout: test error drops from 4.59 % to 3.63 % on CIFAR‑10 and from 21.66 % to 19.07 % on CIFAR‑100. Conversely, q < 2 leads to early over‑fitting and higher error rates, indicating that dense activations are beneficial for visual recognition tasks.
The paper also discusses language modeling with LSTM networks, where input representations are inherently high‑dimensional and sparse. Although detailed numbers are not provided, the authors argue that encouraging sparsity (q < 2) improves language model performance, aligning with the intuition that sparse representations better capture the discrete nature of word tokens.
Overall, Sparseout offers four key advantages: (1) identical implementation complexity and runtime to Dropout; (2) a single tunable hyper‑parameter q that continuously interpolates between sparse and dense activation regimes; (3) a solid theoretical foundation linking stochastic perturbations to L_q regularization; and (4) applicability across convolutional, fully‑connected, and recurrent architectures without sacrificing GPU efficiency. By providing a practical means to investigate and exploit activation sparsity, Sparseout opens new avenues for tailoring regularization to the specific demands of different deep learning tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment