Learning activation functions from data using cubic spline interpolation

Learning acti vation functions fr om data using cubic spline interpolation Simone Scardapane, Michele Scarpiniti, Danilo Comminiello, and Aurelio Uncini Department of Information Engineering, Electronics and T elecommunications (DIET), “Sapienza” Univ ersity of Rome, V ia Eudossiana 18, 00184, Rome. Email: { simone.scardapane, michele.scarpiniti, danilo.comminiell } @uniroma1.it; aurel@ieee.org Abstract. Neural networks require a careful design in order to perform properly on a giv en task. In particular , selecting a good activ ation function (possibly in a data-dependent fashion) is a crucial step, which remains an open problem in the research community . Despite a large amount of inv estigations, most current implementations simply select one ﬁxed function from a small set of candidates, which is not adapted during training, and is shared among all neurons through- out the different layers. Howe ver , neither two of these assumptions can be sup- posed optimal in practice. In this paper , we present a principled way to have data-dependent adaptation of the acti vation functions, which is performed inde- pendently for each neuron. This is achiev ed by lev eraging over past and present advances on cubic spline interpolation, allo wing for local adaptation of the func- tions around their regions of use. The resulting algorithm is relativ ely cheap to implement, and overﬁtting is counterbalanced by the inclusion of a novel damp- ing criterion, which penalizes unwanted oscillations from a predeﬁned shape. Preliminary experimental results v alidate the proposal. Keyw ords: Neural network, acti vation function, spline interpolation 1 Introduction Neural networks (NNs) are extremely po werful tools for approximating complex non- linear functions [7]. The nonlinear behavior is introduced in the NN architecture by the elementwise application of a given nonlinearity , called the activ ation function (AF), at ev ery layer . Since AFs are crucial to the dynamics and computational power of NNs, the history of the two over the last decades is deeply connected [15]. As an example, the use of differ entiable AFs was one of the major breakthroughs in NNs, leading di- rectly to the back-propagation algorithm. More recently , progress on piecewise linear functions was shown to facilitate backward ﬂow of information for training very deep networks [4]. At the same time, it is somewhat surprising that the vast majority of NNs only use a small handful of ﬁxed functions, to be hand-chosen by the practitioner be- fore the learning process. W orse, there is no principled reason to believ e that a ‘good’ nonlinearity might be the same across all layers of the network, or ev en across neurons in the same layer . 2 Scardapane et al. This is sho wn clearly in a recent work by Agostinelli et al. [1], where e very neuron in a deep netw ork was endowed with an adaptable piece wise linear function with possi- bly different parameters, concluding that “ the standar d one-activation-function-ﬁts-all appr oach may be suboptimal ” in current practice. Experiments in AF adaptation have a long history , but they hav e nev er met a wide applicability in the ﬁeld. The simplest ap- proach is to parameterize each sigmoid function in the network by one or more ‘shape’ parameters to be optimized, such as in the seminal 1996 paper by Chen and Chang [3] or the later work by T rentin [16]. Along a similar line, one may consider the use of polynomial AFs, wherein each coefﬁcient of the polynomial is adapted by gradient descent [11]. Additional in vestigations can be found in [20,5,2,10,9]. One strong dra w- back of these approaches is that the parameters in volv ed affect the AF globally , such that a change in one region of the function may be counterproducti ve on a different, possibly faraw ay , region. Sev eral years ago, an alternativ e approach was introduced by using spline interpo- lating functions as AFs [17,6], resulting in what was called a spline AF (SAF). Splines are an attracti ve choice for interpolating unkno wn functions, since they can be described by a small amount of parameters, yet each parameter has a local effect, and only a ﬁxed number of them is in volv ed e very time an output value is computed [18]. The original works in [17,6] had two main drawbacks that prevented a wider use of the underly- ing theory . First, SAFs were only inv estigated in an online setting, where updates are computed one sample at a time. Whether an ef ﬁcient implementation is possible (and feasible) also for batch (or mini-batch) settings was not shown. Secondly , the obtained SAFs had a tendency to overﬁt training data, resulting in oscillatory behaviors which hindered performance. Inspired by recent successes in the ﬁeld of nonlinear adapti ve ﬁltering [13,14], our aim in this paper is two-fold. On one hand, we provide a modern introduction to the use of SAFs in neural networks, with a particular emphasis on their efﬁcient implementation in the case of batch (or mini-batch) training. Our treatment clearly shows that the major problem in their implementation, which is e vident from the discussion abo ve, is the design of an efﬁcient way to regularize their control points. In this sense, as a second contribution we provide a simple (yet effecti ve) ‘damping’ cri- terion to prev ent unwanted oscillations in the testing phase, which penalizes deviations from the original points in terms of ` 2 norm. A restricted set of experiments sho ws that the resulting formulation is able to achiev e a lower test error than a standard NN with ﬁxed AFs, while at the same time learning non-trivial activ ations with different shapes across different neurons. The rest of the paper is organized as follows. Section 2 presents the basic theory of SAFs for the case of a single neuron. Section 3 extends the treatment to the case of a NN with one hidden layer , by deriving the gradient equations for the SAFs parameters in the internal layer . Then, Section 4 goes over the experimental results, while we conclude with some ﬁnal remarks in Section 5. 2 The spline activation function W e begin our treatment of SAFs with the simplest case of a single neuron endo wed with a ﬂexible AF (see [17,13] for additional details). Given a generic input x ∈ R D , the Learning activ ation functions from data 3 output of the SAF is computed as: s = w T x , (1) y = ϕ ( s ; q ) , (2) where w ∈ R D (we suppose that an e ventual bias term is added directly to the input vec- tor), and the AF ϕ ( · ) is parameterized by a vector q ∈ R Q of internal parameters, called knots . The knots are a sampling of the AF v alues ov er Q representati ve points spanning the ov erall function. In particular , we suppose the knots to be uniformly spaced, i.e. q i + 1 = q i + ∆ x , for a ﬁxed ∆ x ∈ R , and symmetrically spaced around the origin. Given s , the output is computed by spline interpolation over the closest knot and its P right- most neighbors. The common choice P = 3, which we adopt in this paper , corresponds to cubic interpolation, and it is generally a good trade-of f between locality of the output and interpolating precision. Giv en the index i of the closest knot, we can deﬁne the normalized abscissa value between q i and q i + 1 as: u = s ∆ x − j s ∆ x k . (3) where b·c is the ﬂoor operator . From u we can compute the normalized reference vector u =  u P u P − 1 . . . u 1  T , while from i we can extract the relev ant control points q i = [ q i q i + 1 . . . q i + P ] T . W e refer to the vector q i as the i th span . The output (2) is then computed as: y = ϕ ( s ) = u T Bq i , (4) where B ∈ R ( P + 1 ) × ( P + 1 ) is called the spline basis matrix. In this work, we use the Catmull-Rom (CR) spline with P = 3, given by: B = 1 2     − 1 3 − 3 1 2 − 5 4 − 1 − 1 0 1 0 0 2 0 0     . (5) Different bases gi ve rise to alternative interpolation schemes, e.g. a spline deﬁned by a CR basis passes through all the control points, but its second deriv ativ e is not continu- ous. Apart from the locality of the output, SAFs ha ve two additional interesting prop- erties. First, the output in (4) is extremely efﬁcient to compute, in volving only vector - matrix products of very small dimensionality . Secondly , deri vati ves with respect to the internal parameters are equiv alently simple and can be written down in closed form. In particular , the deriv ative of the nonlinearity ϕ ( s ) with respect to the input s is giv en by: ∂ ϕ ( s ) ∂ s = ϕ 0 ( s ) = ∂ ϕ ( s ) ∂ u · ∂ u ∂ s =  1 ∆ x  ˙ uBq i , (6) where: ˙ u = ∂ u ∂ u =  Pu P − 1 ( P − 1 ) u P − 2 . . . 1 0  T . (7) 4 Scardapane et al. Giv en this, the deriv ativ e of the SAF output y with respect to w is straightforward: ∂ ϕ ( s ) ∂ w = ϕ 0 ( s ) · ∂ s ∂ w = ϕ 0 ( s ) x , (8) Similarly , for q i we obtain: ∂ ϕ ( s ) ∂ q i = B T u . (9) while we hav e ∂ ϕ ( s ) ∂ q k = 0 for any element q k outside the current span q i . 3 Designing networks with SAF neur ons 3.1 Computing outputs and inner derivati ves Now we consider the more elaborate case of a single hidden layer NN, with a D - dimensional input, H neurons in the hidden layer , and O output neurons. 1 Every neuron in the network uses a SAF with possibly different adaptive control points, which are set independently during the training process. F or easiness of computation, we suppose that the sampling set of the splines is the same for ev ery neuron (i.e., each neuron has Q points equispaced according to the same ∆ x ), and we also have a single shared basis matrix B . The forward phase of the network is similar to that of a standard network. In particular , giv en the input x , we ﬁrst compute the output of the i th hidden neuron, i = 1 , . . . , H , as: h i = ϕ ( w T h i x ; q h i ) . (10) These are concatenated in a single vector h = [ h 1 , . . . , h H , 1 ] T , and the i th output of the network, i = 1 , . . . , O , is giv en by: f i ( x ) = y i = ϕ ( w T y i h ; q y i ) . (11) The deri vati ves with respect to the parameters  w y i , q y i  , i = 1 , . . . , O can be computed directly with (8)-(9), substituting x with h . By back-propagation, the deriv ativ e of the i th output with respect to the j th (inner) weight vector w h j is similar to a standard NN: ∂ y j ∂ w h i = ϕ 0 ( s y j ) · ϕ 0 ( s h i ) · b w h i c j · x , (12) where with a slight abuse of notation we let s y j denote the acti vation of the j th output (and similarly for s h i ), b·c j extracts the j th element of its input vector , and the two ϕ 0 ( · ) are given by (6). For the deri vati ve of the control points of the i th hidden neuron, denote by q h i , k the currently activ e span, and by u h i the corresponding reference vector . The deriv ativ e with respect to the j th output is then giv en by: ∂ y j ∂ q h i , k = ϕ 0 ( s y j ) · b w h i c j · B T u h i . (13) 1 W e note that the follo wing treatment can be extended easily to the case of a network with more than one hidden layer . Howe ver , restricting it to a single layer allow us to keep the discussion focused on the problems/advantages arising in the use of SAFs. W e leave this extension to a future work. Learning activ ation functions from data 5 3.2 Initialization of the control points An important aspect that we have not discussed yet is how to properly initialize the control points. One immediate choice is to sample their v alues from an AF which is known to work well on the giv en problem, e.g. a hyperbolic tangent. In this way , the network is guaranteed to work similarly to a standard NN in the initial phase of learning. Additionally , we hav e found good improvements in error by adding Gaussian noise N ( 0 , σ 2 ) with small variance σ 2 to a randomly chosen subset of control points (around 5% in our experiments). This provides a good variability in the beginning, similarly to how connections are set close to (b ut not identically equal to) zero during initialization. 3.3 Choosing a training criterion Suppose we are provided with a training set of N input/output pairs in the form { x i , d i } N i = 1 . For simplicity of notation, we denote by w the concatenation of all weight vectors  w h i  and  w y i  , and by q a similar concatenation of all control points. Training can be for- mulated as the minimization of the following cost function: J ( w , q ) = 1 N N ∑ i = 1 L ( d i , f ( x i )) + λ w R w ( w ) + λ q R q ( q ) , (14) where L ( · , · ) is an error function, while R w ( · ) and R q ( · ) provide meaningful regular - ization on the two set of parameters. The ﬁrst two terms are well-kno wn in the neural network literature [7], and they can be set accordingly . Particularly , in our experiments we consider a squared error term L ( d i , f ( x i )) = k d i − f ( x i ) k 2 2 , and ` 2 regularization on the weights R w ( w ) = k w k 2 2 . The deriv atives of L ( · , · ) can be computed straightforwardly with the formulas presented in Section 3.1. The term R q ( q ) is used to avoid overﬁtted solutions for the control points. In fact, its presence is the major difference with respect to previous attempts at implementing SAFs in neural networks [17], wherein ov erﬁtting was counterbalanced by choosing a large value for ∆ x , which in a way goes outside the philosophy of spline interpolation itself. At the same time, choosing a proper form for the regularization term is non-trivial, as the term should be cheap to compute, and it should introduce just as much a priori information as needed, without hindering the training process. Most of the literature on regularizing w cannot be used here, as the corresponding formulations do not make sense in the context of spline interpolation. As an example, simply penalizing the ` 2 norm of q leads to functions close to the zero function, while imposing sparsity is also meaningless. For the purpose of this paper , we consider the following ‘damping’ criterion: R q ( q ) = k q − q o k 2 2 , (15) where q o represents the initial v alues for the control points, as discussed in the pre vious section (without considering additional noise). The criterion makes intuiti ve sense as follows: while for w we wish to penalize unwanted de viations from very small weights (which can be justiﬁed with arguments from learning theory), in the case of q we are 6 Scardapane et al. interested in penalizing changes with respect to a ‘good’ function parameterized by the initial control points q o , namely one of the standard AFs used in NN training. In fact, setting a value for λ q very high essentially deactiv ates the adaptation of the control points. Clearly , other choices are possible, and in this sense this paper serves as a start- ing point for further in vestigations to wards this objective. As an example, we may wish to penalize ﬁrst (or second) order deriv atives of the splines in order to force a desired lev el of smoothness [18]. 3.4 Remarks on the implementation In order to be usable in practice, SAFs require an efﬁcient implementation to compute outputs and deri v ativ es concurrently for the entire training dataset or , alternati vely , for a properly chosen mini-batch (in the case of stochastic optimization algorithms). T o be gin with, we underline that the equations for the reference vector (see (3)) do not depend on the speciﬁc neuron, and for this reason they can easily be v ectorized layer-wise on most numerical algebra libraries to obtain all vectors concurrently . Additionally , the index es and relati ve terms Bq i in (4) can be cached during the forward pass, to be reused during the computation of the deri vati ves. In this sense, the outputs of a layer and its deriv atives can be computed by one 4 × 4 matrix-vector computation, and three 4-dimensional inner products, which have to be repeated for e very pair input/neuron. In our experience, the cost of a relatively well-optimized implementation does not exceed twice that of a standard network for medium-sized batches, where the most onerous operation is the reshaping of the gradients in (9) and (13) into a single vector of gradients relati ve to the global vector q . 4 Experimental results 4.1 Experimental setup T o ev aluate the preliminary proposal, we consider two simple regression benchmarks for neural networks, the ‘chemical’ dataset (included among MA TLAB’ s testbeds for function ﬁtting), and the ‘California Housing’. 2 They ha ve respectiv ely 498 and 20640 examples, and 8 numerical features. Inputs are normalized in the [ − 1 , + 1 ] range, while outputs are normalized in the [ − 0 . 5 , + 0 . 5 ] range. W e compare a NN with 5 hidden neu- rons and tanh ( · ) AFs (denoted as ‘Standard’ in the results), and a NN with the same number of neurons and SAF nonlinearities. The weight vector w is initialized with the method described in [4]. Each SAF is initialized from a tanh ( · ) nonlinearity , and control points are deﬁned in the [ − 2 , + 2 ] range with ∆ x = 0 . 2, which is a good compromise between locality of the SAFs and the ov erall number of adaptable parameters. For the ﬁrst scenario, λ q is kept to a small value of 10 − 5 . For each experiment, a random 30% of the dataset is k ept for testing, and results are av eraged ov er 15 different splits to ave r- age out statistical effects. Error is computed with the Normalized Root Mean-Squared Error (NRMSE). The optimization problems are solved using a freely available MA T - LAB implementation of the Polack-Ribiere variant of the nonlinear conjugate gradient 2 http://www.dcc.fc.up.pt/ ˜ ltorgo/Regression/cal_housing.html Learning activ ation functions from data 7 T able 1. A verage results for scenario 1 ( λ w = 1), together with one standard deviation. Dataset Nonlinearity Tr . RMSE T .st NRMSE Chemical Standard 1 . 00 ± 0 . 00 1 . 00 ± 0 . 01 SAF 0 . 29 ± 0 . 02 0 . 31 ± 0 . 02 Calhousing Standard 1 . 02 ± 0 . 00 1 . 01 ± 0 . 01 SAF 0 . 56 ± 0 . 01 0 . 57 ± 0 . 02 optimization algorithm by C.E. Rasmussen. [12]. 3 The optimization process is allo wed 1500 maximum iterations. MA TLAB code for the experiments is also av ailable on the web . 4 W e brieﬂy remark that the MA TLAB library , apart from repeating the exper - iments presented here, is also designed to handle networks with more than a single hidden layer, and implements the ADAM algorithm [8] for stochastic training in case of a larger dataset. 4.2 Scenario 1 : strong underﬁtting As a ﬁrst example, we consider a scenario of strong underﬁtting, wherein the user has misleadingly selected a very large value of λ w = 1, leading in turn to extremely small values for the elements of w after training. Results in terms of training and test RMSE are pro vided in T ab . 1. Since the acti vations of the NN tend to be v ery close to 0 (where the hyperbolic tangent operates in an almost-linear regime), standard NNs have a con- stant zero output, leading to a RMSE of 1. Nonetheless, SAF netw orks are able to reach a very satisfactory le vel of performance, which in the ﬁrst case is almost comparable to that of a fully optimized network (see the follo wing section). T o sho w the reasons for this, we ha ve plotted four representati ve nonlinearities after training in Fig. 1. It is easy to see that the nonlinearities have adapted to act as ‘ampli- ﬁers’ for the acti vations in their operating regime, with mild and strong peaks around 0. Of particular interest is the fact that the resulting SAFs need not be perfectly centered around 0 (e.g. Fig. 1c), or e ven symmetrical around the y -axis (e.g. Fig. 1d). In fact, the splines are able to efﬁciently counterbalance a bad setting for the weights, with behav- iors which would be very hard (or close to impossible) using standard setups with ﬁxed, shared, mild nonlinearities. 4.3 Scenario 2 : well-optimized parameters In our second scenario, we consider a similar comparison with respect to before, b ut we ﬁne-tune the parameters of the two methods using a grid-search with a 3-fold cross- validation on the training data as performance measure. Both λ w and λ q (only for the 3 http://learning.eng.cam.ac.uk/carl/code/minimize/ 4 [The URL has been hidden for the revie w process.] 8 Scardapane et al. Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 2 Ac tiv ati on s -2 0 2 Spline value -1 0 1 2 Fig. 1. Non-trivial representati ve SAFs after training for scenario 1. T able 2. Optimal parameters (averaged ov er the runs) found by the grid-search procedure for scenario 2. Dataset Nonlinearity λ w λ q Chemical Standard 10 − 3 — SAF 10 − 2 10 − 4 Calhousing Standard 10 − 4 — SAF 10 − 3 10 − 4 proposed algorithm) are searched in an exponential interval 2 j , with j = − 10 , . . . , 5. Optimal parameters found by the grid-search are listed in T able 2, while results in terms of training and test NRMSE are collected in T able 3. Overall, we see that the NNs endowed with the SAF nonlinearities are able to sur- pass by a large margin a standard NN, and the results from the previous scenario. The only minor drawback evidenced in T able 3 is that the SAF network has some ov erﬁtting occurring in the ‘chemical’ dataset (around 2 points of NRMSE), showing that there is still some room for improv ement in terms of spline optimal regularization. Also in this case, we plot some representatives SAFs after training (taken among those which are not trivially identical to the tanh nonlinearity) in Fig. 2. As before, in general SAFs tend to provide an ampliﬁcation (with a possible change of sign) of their activ ation around some region of operation. It is interesting to observe that, also in this case, the optimal shape need not be symmetric (e.g. Fig. 2a), and might even be far Learning activ ation functions from data 9 T able 3. A verage results for scenario 2 (ﬁne-tuning for parameters), together with one standard deviation. Dataset Nonlinearity Tr . RMSE T .st NRMSE Chemical Standard 0 . 32 ± 0 . 01 0 . 32 ± 0 . 02 SAF 0 . 26 ± 0 . 01 0 . 28 ± 0 . 02 Calhousing Standard 0 . 55 ± 0 . 01 0 . 55 ± 0 . 01 SAF 0 . 51 ± 0 . 02 0 . 51 ± 0 . 02 Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 Ac tiv ati on s -2 0 2 Spline value -1 0 1 2 Fig. 2. Non-trivial representati ve SAFs after training for scenario 2. from centered around 0 (e.g. Fig. 2c). Resulting nonlinearities can also present some additional non-trivial behaviors, such as a small region of insensibility around 0 (e.g. Fig. 2d), or a re gion of pre-saturation before the actual tanh saturation (e.g. Fig.s 2e-2f). 5 Conclusion In this paper, we have presented a principled way to adapt the activ ation functions of a neural network from training data, locally and independently for each neuron. Particu- larly , each nonlinearity is implemented with cubic spline interpolation, whose control points are adapted in the optimization phase. Overﬁtting is controlled by a no vel ` 2 reg- ularization criterion av oiding unwanted oscillations. Albeit efﬁcient, this criterion does constrain the shapes of the resulting functions by a certain degree. In this sense, the de- sign of more advanced regularization terms is a promising line of research. Additionally , 10 Scardapane et al. we plan on exploring the application of SAFs to deeper networks, where it is expected that the statistics of the neurons’ activ ations can change signiﬁcantly layer-wise [4]. References 1. Agostinelli, F ., Hoffman, M., Sadowski, P ., Baldi, P .: Learning activ ation functions to im- prov e deep neural networks. arXiv preprint arXi v:1412.6830 (2014) 2. Chandra, P ., Singh, Y .: An activ ation function adapting training algorithm for sigmoidal feed- forward networks. Neurocomputing 61, 429–437 (2004) 3. Chen, C.T ., Chang, W .D.: A feedforward neural network with function shape autotuning. Neural networks 9(4), 627–641 (1996) 4. Glorot, X., Bengio, Y .: Understanding the dif ﬁculty of training deep feedforward neural net- works. In: Int. conf. on artiﬁcial intell. and stat. pp. 249–256 (2010) 5. Goh, S., Mandic, D.: Recurrent neural networks with trainable amplitude of acti vation func- tions. Neural Networks 16(8), 1095–1100 (2003) 6. Guarnieri, S., Piazza, F ., Uncini, A.: Multilayer feedforward networks with adaptiv e spline activ ation function. IEEE Trans. Neural Netw . 10(3), 672–683 (1999) 7. Haykin, S.: Neural networks and learning machines. Pearson Education, 3rd edn. (2009) 8. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Con- ference for Learning Representations (2015), arXiv preprint arXi v:1412.6980 9. Lin, M., Chen, Q., Y an, S.: Network in network. arXi v preprint arXiv:1312.4400 (2013) 10. Ma, L., Khorasani, K.: Constructi ve feedforward neural networks using hermite polynomial activ ation functions. IEEE Trans. Neural Netw . 16(4), 821–833 (2005) 11. Piazza, F ., Uncini, A., Zenobi, M.: Artiﬁcial neural networks with adaptive polynomial ac- tiv ation function. In: Int. Joint Conf. on Neural Networks. vol. 2, pp. II–343. IEEE/INNS (1992) 12. Rasmussen, C.: Gaussian processes for machine learning. MIT Press (2006) 13. Scarpiniti, M., Comminiello, D., Parisi, R., Uncini, A.: Nonlinear spline adaptiv e ﬁltering. Signal Process. 93(4), 772–783 (2013) 14. Scarpiniti, M., Comminiello, D., Scarano, G., Parisi, R., Uncini, A.: Steady-state perfor- mance of spline adaptiv e ﬁlters. IEEE T rans. Signal Process. 64(4), 816–828 (2016) 15. Schmidhuber , J.: Deep learning in neural networks: An ov erview . Neural Networks 61, 85– 117 (2015) 16. T rentin, E.: Networks with trainable amplitude of activ ation functions. Neural Networks 14(4), 471–493 (2001) 17. V ecci, L., Piazza, F ., Uncini, A.: Learning and approximation capabilities of adaptiv e spline activ ation function neural networks. Neural Networks 11(2), 259–270 (1998) 18. W ahba, G.: Spline models for observational data. Siam (1990) 19. W ang, Y ., Shen, D., T eoh, E.: Lane detection using catmull-rom spline. In: IEEE Int. Conf. on Intell. V ehicles. pp. 51–57 (1998) 20. Zhang, M., Xu, S., Fulcher , J.: Neuron-adaptive higher order neural-network models for automated ﬁnancial data modeling. IEEE T rans. Neural Netw . 13(1), 188–204 (2002)

Learning activation functions from data using cubic spline interpolation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment