Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study
Deep neural networks have recently achieved state-of-the-art results in many machine learning problems, e.g., speech recognition or object recognition. Hitherto, work on rectified linear units (ReLU) provides empirical and theoretical evidence on performance increase of neural networks comparing to typically used sigmoid activation function. In this paper, we investigate a new manner of improving neural networks by introducing a bunch of copies of the same neuron modeled by the generalized Kumaraswamy distribution. As a result, we propose novel non-linear activation function which we refer to as Kumaraswamy unit which is closely related to ReLU. In the experimental study with MNIST image corpora we evaluate the Kumaraswamy unit applied to single-layer (shallow) neural network and report a significant drop in test classification error and test cross-entropy in comparison to sigmoid unit, ReLU and Noisy ReLU.
💡 Research Summary
This paper introduces a novel activation function for artificial neural networks, called the Kumaraswamy unit, which is derived from modeling a “bunch” of identical neurons using the generalized Kumaraswamy distribution (KUM‑G). The authors start from the observation that traditional activation functions such as the sigmoid suffer from vanishing gradients, while piece‑wise linear functions like ReLU have become popular due to their ability to propagate gradients more effectively. However, ReLU’s unbounded positive output can lead to saturation of some hidden units and may cause instability in certain training regimes.
To address these issues, the authors propose replicating a single neuron (i.e., using the same weight vector and bias) b times, treating each copy as an independent component. Additionally, each component is assumed to consist of a sub‑component count a, reflecting an internal complexity of the neuron. The joint activation probability of the entire bunch is then modeled by the generalized Kumaraswamy distribution, which for a base distribution G(x) and shape parameters a, b is defined as
K_G(x|a,b) = 1 – (1 – G(x)^a)^b.
Choosing G(x) as the sigmoid σ(x) yields the Kumaraswamy unit:
K_σ(x|a,b) = 1 – (1 – σ(x)^a)^b.
When a = b = 1 the function reduces to the ordinary sigmoid. By increasing a and b the curve becomes steeper around zero, approximating the behavior of a ReLU while still producing outputs confined to the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment