Additive function approximation in the brain
Many biological learning systems such as the mushroom body, hippocampus, and cerebellum are built from sparsely connected networks of neurons. For a new understanding of such networks, we study the function spaces induced by sparse random features an…
Authors: Kameron Decker Harris
Additiv e function appr oximation in the brain Kameron Deck er Harris Paul G. Allen School of Computer Science and Engineering, Department of Biology Univ ersity of W ashington kamdh@uw.edu Abstract Many biological learning systems such as the mushroom body , hippocampus, and cerebellum are built from sparsely connected netw orks of neurons. For a ne w understanding of such networks, we study the function spaces induced by sparse random features and characterize what functions may and may not be learned. A network with d inputs per neuron is found to be equiv alent to an additive model of order d , whereas with a degree distribution the netw ork combines additiv e terms of dif ferent orders. W e identify three specific adv antages of sparsity: additi ve function approximation is a po werful inductiv e bias that limits the curse of dimensionality , sparse networks are stable to outlier noise in the inputs, and sparse random features are scalable. Thus, ev en simple brain architectures can be po werful function approximators. Finally , we hope that this work helps popularize kernel theories of networks among computational neuroscientists. 1 Introduction Kernel function spaces are popular among machine learning researchers as a potentially tractable frame work for understanding artificial neural networks trained via gradient descent [e.g. 1 , 2 , 3 , 4 , 5 , 6 ]. Artificial neural networks are an area of intense interest due to their often surprising empirical performance on a number of challenging problems and our still incomplete theoretical understanding. Y et computational neuroscientists ha ve not widely applied these new theoretical tools to describe the ability of biological networks to perform function approximation. The idea of using fixed random weights in a neural network is primordial, and was a part of Rosenblatt’ s perceptron model of the retina [ 7 ]. Random features have then resurfaced under man y guises: random centers in radial basis function networks [ 8 ], functional link networks [ 9 ], Gaussian processes (GPs) [ 10 , 11 ], and so-called e xtreme learning machines [ 12 ]; see [ 13 ] for a re vie w . Random feature netw orks, where the neurons are initialized with random weights and only the readout layer is trained, were proposed by Rahimi and Recht in order to improv e the performance of kernel methods [ 14 , 15 ] and can perform well for many problems [ 13 ]. In parallel to these de v elopments in machine le arning, computational neuroscientists ha ve also studied the properties of random networks with a goal tow ards understanding neurons in real brains. T o a first approximation, many neuronal circuits seem to be randomly organized [ 16 , 17 , 18 , 19 , 20 ]. Howe ver , the recent theory of random features appears to be mostly unknown to the greater computational neuroscience community . Here, we study random feature networks with spar se connectivity : the hidden neurons each receive input from a random, sparse subset of input neurons. This is inspired by the observ ation that the connectivity in a v ariety of predominantly feedforward brain networks is approximately random and sparse. These brain areas include the cerebellar cortex, inv ertebrate mushroom body , and dentate gyrus of the hippocampus [ 21 ]. All of these areas perform pattern separation and associativ e learning. The cerebellum is important for motor control, while the mushroom body and dentate gyrus are Preprint. Under re view . Spar se connectivity l input s m f ea tur es output A dv anta g es o f spar sity • A dditivity as inductiv e bias • S tability t o input no ise • Scala ble wiri ng & comp uta tion d in-degr ee Ex amp le righ t: Learning the additiv e f unction fr om limit e d da ta using spar se r andom f ea tur es T r ue Estima t e Figure 1: Sparse connectivity in a shallow neural netw ork. The function sho wn is the sparse random feature approximation to an additive sum of sines, learned from poorly distributed samples (red crosses). Additivity of fers structure which may be lev eraged for fast and ef ficient learning. general learning and memory areas for inv ertebrates and vertebrates, respectiv ely , and may hav e ev olv ed from a similar structure in the ancient bilaterian ancestor [ 22 ]. Recent work has argued that the sparsity observed in these areas may be optimized to balance the dimensionality of representation with wiring cost [ 20 ]. Sparse connectivity has been used to compress neural netw orks and speed up computation [ 23 , 24 , 25 ], whereas con v olutions are a kind of structured sparsity [ 26 , 27 ]. W e sho w that sparse random features approximate additiv e kernels [ 28 , 29 , 30 , 31 ] with arbitrary orders of interaction. The in-degree of the hidden neurons d sets the order of interaction. When the degrees of the neurons are drawn from a distribution, the resulting k ernel contains a weighted mixture of interactions. These sparse features of fer advantages of generalization in high-dimensions, stability under perturbations of their input, and computational and biological efficienc y . 2 Background: Random features and ker nels Now we will introduce the mathematical setting and revie w ho w random features give rise to kernels. The simplest artificial neural network contains a single hidden layer, of size m , receiving input from a layer of size l (Figure 1 ). The acti vity in the hidden layer is giv en by , for i ∈ [ m ] , φ i ( x ) = h ( w | i x + b i ) . (1) Here each φ i is a feature in the hidden layer , h is the nonlinearity , W = ( w 1 , w 2 , . . . , w m ) ∈ R l × m are the input to mixed weights, and b ∈ R m are their biases. W e can write this in vector notation as φ ( x ) = h ( W | x − b ) , where φ : R l → R m . Random features networks are dra w their input-hidden layer weights at random. Let the weights W and biases b in the feature expansion ( 1 ) be sampled i.i.d. from a distribution µ on R l +1 . Under mild assumptions, the inner product of the feature vectors for two inputs con verges to its e xpectation 1 m φ ( x ) | φ ( x 0 ) m →∞ − − − − → E [ φ ( x ) φ ( x 0 )] = Z h ( w | x + b ) h ( w | x 0 + b ) d µ ( w , b ) := k ( x , x 0 ) . (2) W e identify the limit ( 2 ) with a reproducing k ernel k ( x , x 0 ) induced by the random features, since the limiting function is an inner product and thus alw ays positiv e semidefinite [ 14 ]. The kernel defines an associated reproducing kernel Hilbert space (RKHS) of functions. For a finite network of width m , the inner product 1 m φ ( x ) | φ ( x 0 ) is a randomized approximation to the kernel k ( x , x 0 ) . 3 Sparsely connected random feature k ernels W e now turn to our main result: the general form of the random feature kernels with sparse, indepen- dent weights. For simplicity , we start with a regular model and then generalize the result to networks with varying in-de gree. T wo kernels that can be computed in closed form are highlighted. Fix an in-degree d , where 1 ≤ d ≤ l , and let µ | d be a distribution on R d which induce, together with some nonlinearity h , the kernel k d ( z , z 0 ) on z , z 0 ∈ R d (for the moment, d is not random). Sample a 2 sparse feature i ∈ [ m ] in two steps: First, pick d neighbors from all l d uniformly at random. Let N i ⊆ [ l ] denote this set of neighbors. Second, sample w j i ∼ µ | d if j ∈ N i and otherwise set w j i = 0 . W e find that the resulting kernel k reg d ( x , x 0 ) = E [ E [ φ ( x N ) φ ( x 0 N ) |N ]] = l d − 1 X N : |N | = d k d ( x N , x 0 N ) . (3) Here x N denotes the length d vector of x restricted to the neighborhood N , with the other l − d entries in x ignored. More generally , the in-degrees may be chosen independently according to a de gr ee distrib ution , so that d becomes a random variable. Let D ( d ) be the probability mass function of the hidden node in-degrees. Conditional on node i having de gree d i , the in-neighborhood N i is chosen uniformly at random among the l d i possible sets. Then the induced kernel becomes k dist D ( x , x 0 ) = E [ E [ φ ( x N ) φ ( x 0 N ) |N , d ]] = l X d =0 D ( d ) k reg d ( x , x 0 ) . (4) For e xample, if ev ery layer-tw o node chooses its inputs independently with probability p , the D ( d i ) is the probability mass function of the binomial distribution Bin( l, p ) . The regular model ( 3 ) is a special case of ( 4 ) with D ( d 0 ) = I { d 0 = d } . Extending the proof techniques in [ 14 ] yields: Claim The random map 1 m φ ( x ) | φ ( x 0 ) with κ -Lipschitz nonlinearity uniformly appr oximates k dist D ( x , x 0 ) to error using m = Ω( lκ 2 2 log C ) many features (the pr oof is contained in Appendix C ). T wo simple examples W ith Gaussian weights and regular d = 1 , we find that (see Appendix B ) k reg 1 ( x , x 0 ) = 1 − 1 l k sgn( x ) − sgn( x 0 ) k 0 if h = step function, and (5) k reg 1 ( x , x 0 ) = 1 − c l k x − x 0 k 1 if h = sign function . (6) 4 Advantages of sparse connectivity 4.1 Additive modeling The regular degree kernel ( 3 ) is a sum of kernels that only depend on combinations of d inputs, making it an additive kernel of order d . The general expression for the degree distribution k ernel ( 4 ) illustrates that sparsity leads to a mixture of additive kernels of different orders. These hav e been referred to as additiv e GPs [ 30 ], but these kind of models hav e a long history as generalized additive models [e.g. 28 , 32 ]. For the regular degree model with d = 1 , the sum in ( 3 ) is ov er neighborhoods of size one, simply the individual indices of the input space. Thus, for any two input neighborhoods N 1 and N 2 , we ha ve |N 1 ∩ N 2 | = ∅ , and the RKHS corresponding to k reg 1 ( x , x 0 ) is the direct sum of the subspaces H = H 1 ⊕ . . . ⊕ H l . Thus regular d = 1 defines a first-order additi ve model, where f ( x ) = f 1 ( x 1 ) + . . . + f l ( x l ) . When d > 1 we allow interactions between subsets of d variables, e.g. regular d = 2 leads to f ( x ) = f 12 ( x 1 , x 2 ) + . . . + f l − 1 ,l ( x l − 1 , x l ) , all pairwise terms. These interactions are defined by the structure of the terms k d ( x N , x 0 N ) . Finally , the degree distribution D ( d ) determines how much weight to place on dif ferent degrees of interaction. Generalization from fewer examples in high dimensions Stone prov ed that first-order additi ve models do not suf fer from the curse of dimensionality [ 33 , 34 ], as the e xcess risk does not depend on the dimension l . Kandasamy and Y u [ 31 ] extended this result to d th-order additiv e models and found a bound on the excess risk of O ( l 2 d n − 2 s 2 s + d ) or O ( l 2 d C d /n ) for kernels with polynomial or exponential eigenv alue decay rates ( n is the number of samples and the constants s and C parametrize rates). Without additi vity , these weaken to O ( n − 2 s 2 s + l ) and O ( C l /n ) , much worse when l d . Similarity to dropout Dropout regularization [ 35 , 36 ] in deep networks has been analyzed in a kernel/GP frame work [ 37 ], leading to ( 4 ) with D = Bin( l , p ) for a particular base kernel. Dropout may thus improv e generalization by enforcing approximate additivity , for the reasons abov e. 3 4.2 Stability: robustness to noise or attacks affecting a few inputs Equations ( 5 ) and ( 6 ) are similar: They dif fer only by the presence of an ` 0 -“norm” v ersus an ` 1 -norm and the presence of the sign function. Both norms are stable to outlying coordinates in an input x . This property also holds for 1 < d l , since every feature φ i ( x ) only depends on d inputs, and therefore only a minority of the m features will be af fected by the fe w outliers. 1 Thus, suf ficiently sparse features will be less affected by sparse noise than a fully-connected network. Furthermore, any regressor f ( x ) = α | φ ( x ) built from these features will also be stable. By Cauchy-Schwartz, | f ( x ) − f ( x 0 ) | ≤ k α k 2 k φ ( x ) − φ ( x 0 ) k 2 . Thus if x 0 = x + e where e is noise with small number of nonzero entries, then f ( x 0 ) ≈ f ( x ) since φ ( x ) ≈ φ ( x 0 ) . The network intrinsically denoises its inputs, which may of fer adv antages [e.g. 20 ]. Stability also may guarantee the rob ustness of netw orks to adversarial attacks [ 38 , 39 , 40 ], thus sparse networks are rob ust to attacks on only a fe w inputs. 4.3 Scalability: computational and biological Computational Sparse random features giv e potentially huge improvements in scaling. Direct implementations of additiv e models incur a large cost for d > 1 , since ( 3 ) requires a sum over l d = O ( l d ) neighborhoods. 2 This leads to O ( n 2 l d ) time to compute the Gram matrix of n examples and O ( nl d ) operations to e valuate f ( x ) . In our case, since the random features method is primal, we need to perform O ( nmd ) computations to e valuate the feature matrix and the cost of e v aluating f ( x ) remains O ( md ) . 3 Sparse matrix-vector multiplication makes e v aluation faster than the O ( ml ) time it takes when connectivity is dense. For ridge regression, we hav e the usual advantages that computing an estimator takes O ( nm 2 + nmd ) time and O ( nm + md ) memory , rather than O ( n 3 ) time and O ( n 2 ) memory for a naïv e kernel ridge method. Biological In a small animal such as a flying insect, space is e xtremely limited. Sparsity of fers a huge adv antage in terms of wiring cost [ 20 ]. Additive approximation also means that such animals can learn much more quickly , as seen in the mushroom body [ 41 , 42 , 43 ]. While the pre vious computational points do not apply as well to biology , since real neurons operate in parallel, fewer operations translate into lower metabolic cost for the animal. 5 Discussion Inspired by their ubiquity in biology , we have studied sparse random netw orks of neurons using the theory of random features, finding the advantages of additi vity , stability , and scalability . This theory sho ws that sparse netw orks such as those found in the mushroom body , cerebellum, and hippocampus can be powerful function approximators. K ernel theories of neural circuits may be more broadly applicable in the field of computational neuroscience. Expanding the theory of dimensionality in neuroscience Learning is easier in additiv e function spaces because they are low-dimensional , a possible explanation for fe w-shot learning in biological systems. Our theory is complementary to existing theories of dimensionality in neural systems [ 16 , 44 , 45 , 46 , 47 , 20 , 48 , 49 , 50 ], which defined dimensionality using a (debatably) ad hoc ske wness measure of cov ariance eigen v alues. K ernel theory extends this concept, measuring dimensionality similarly [ 51 ] in the space of nonlinear functions spanned by the kernel. Limitations W e model biological neurons as simple scalar functions, completely ignoring time and neuromodulatory context. It seems possible that a kernel theory could be developed for time- and context-dependent features. Our networks suppose i.i.d. weights, but weights that follo w Dale’ s law should also be considered. W e hav e not studied the sparsity of acti vity , postulated to be relev ant in cerebellum. It remains to be demonstrated how the theory can make concrete, testable predictions, e.g. whether this theory may explain identity v ersus concentration encoding of odors or the discrimination/generalization tradeoff under e xperimental conditions. 1 If one coordinate of x is noisy , the probability that the i th neuron is af fected is d i /l 1 . 2 There is a more efficient method when w orking with a tensor pr oduct kernel , as in [ 29 , 30 , 31 ]. 3 Note that we need to take m = Ω( l ) to ensure good approximation of the kernel (Appendix C ). 4 Acknowledgments KDH was supported by a W ashington Research Foundation postdoctoral fellowship. Thank you to Rajesh Rao for support during this project and to Bing Brunton for support and many helpful comments. References [1] Francis Bach. Breaking the Curse of Dimensionality with Con vex Neural Netw orks. Journal of Mac hine Learning Resear ch , 18(19):1–53, 2017. [2] Arthur Jacot, Franck Gabriel, and Clément Hongler . Neural T angent Kernel: Con verge nce and Generaliza- tion in Neural Networks. arXiv:1806.07572 [cs, math, stat] , June 2018. [3] Lenaic Chizat and Francis Bach. On the Global Con ver gence of Gradient Descent for Ov er-parameterized Models using Optimal T ransport. arXiv:1805.09545 [cs, math, stat] , May 2018. [4] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A Mean Field V ie w of the Landscape of T wo- Layers Neural Networks. arXiv:1804.06561 [cond-mat, stat] , April 2018. [5] Grant M. Rotsk off and Eric V anden-Eijnden. T rainability and Accurac y of Neural Networks: An Interacting Particle System Approach. arXiv:1805.00915 [cond-mat, stat] , May 2018. [6] Luca V enturi, Afonso S. Bandeira, and Joan Bruna. Spurious V alleys in T wo-layer Neural Network Optimization Landscapes. arXiv:1802.06384 [cs, math, stat] , February 2018. [7] F . Rosenblatt. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psycholo gical Review , 65(6):386–408, 1958. [8] D. S. Broomhead and David Lowe. Radial Basis Functions, Multi-V ariable Functional Interpolation and Adaptiv e Networks. T echnical Report RSRE-MEMO-4148, Royal Signals and Radar Establishment Malvern (UK), March 1988. [9] B. Igelnik and Y oh-Han Pao. Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE T r ansactions on Neural Networks , 6(6):1320–1329, No vember 1995. ISSN 1045-9227. doi: 10.1109/72.471375. [10] Radford M. Neal. Priors for Infinite Networks. In Bayesian Learning for Neural Networks , Lecture Notes in Statistics, pages 29–53. Springer , New Y ork, NY, 1996. ISBN 978-0-387-94724-2 978-1-4612-0745-0. doi: 10.1007/978- 1- 4612- 0745- 0_2. [11] Christopher K. I. W illiams. Computing with Infinite Networks. In M. C. Mozer , M. I. Jordan, and T . Petsche, editors, Advances in Neural Information Processing Systems 9 , pages 295–301. MIT Press, 1997. [12] L. P . W ang and C. R. W an. Comments on "The Extreme Learning Machine". IEEE T r ansactions on Neur al Networks , 19(8):1494–1495, August 2008. ISSN 1045-9227. doi: 10.1109/TNN.2008.2002273. [13] Simone Scardapane and Dianhui W ang. Randomness in neural networks: An ov erview . W iley Inter dis- ciplinary Revie ws: Data Mining and Knowledge Discovery , 7(2):e1200, 2017. ISSN 1942-4795. doi: 10.1002/widm.1200. [14] Ali Rahimi and Benjamin Recht. Random Features for Large-Scale Kernel Machines. In J. C. Platt, D. K oller , Y . Singer , and S. T . Roweis, editors, Advances in Neural Information Pr ocessing Systems 20 , pages 1177–1184. Curran Associates, Inc., 2008. [15] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Confer ence on Communication, Contr ol, and Computing , pages 555–561, September 2008. doi: 10.1109/ALLER T ON.2008.4797607. [16] Surya Ganguli and Haim Sompolinsky . Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis. Annual Review of Neur oscience , 35(1):485–508, 2012. doi: 10.1146/annurev- neuro- 062111- 150410. [17] Sophie J. C. Caron, V anessa Ruta, L. F . Abbott, and Richard Axel. Random con ver gence of olfactory inputs in the Drosophila mushroom body . Nature , 497(7447):113–117, May 2013. ISSN 0028-0836. doi: 10.1038/nature12063. [18] Sophie J. C. Caron. Brains Don’t Play Dice—or Do They? Science , 342(6158):574–574, November 2013. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.1245982. [19] Kameron Decker Harris, T atiana Dashevskiy , Joshua Mendoza, Alfredo J. Garcia, Jan-Marino Ramirez, and Eric Shea-Brown. Different roles for inhibition in the rhythm-generating respiratory network. Journal of Neur ophysiology , 118(4):2070–2088, October 2017. ISSN 0022-3077, 1522-1598. doi: 10.1152/jn. 00174.2017. [20] Ashok Litwin-Kumar, Kameron Decker Harris, Richard Axel, Haim Sompolinsky , and L. F . Abbott. Optimal Degrees of Synaptic Connecti vity. Neur on , 93(5):1153–1164.e7, March 2017. ISSN 0896-6273. doi: 10.1016/j.neuron.2017.01.030. [21] N. Alex Cayco-Gajic and R. Angus Silver . Re-ev aluating Circuit Mechanisms Underlying Pattern Separa- tion. Neur on , 101(4):584–602, February 2019. ISSN 08966273. doi: 10.1016/j.neuron.2019.01.044. [22] Gabriella H. W olff and Nicholas J. Strausfeld. Genealogical correspondence of a forebrain centre implies an ex ecutiv e brain in the protostome–deuterostome bilaterian ancestor . Philosophical T ransactions of the Royal Society B: Biological Sciences , 371(1685):20150055, January 2016. doi: 10.1098/rstb .2015.0055. 5 [23] Song Han, Jeff Pool, John T ran, and W illiam Dally . Learning both W eights and Connections for Efficient Neural Netw ork. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems 28 , pages 1135–1143. Curran Associates, Inc., 2015. [24] Song Han, Huizi Mao, and W illiam J. Dally . Deep Compression: Compressing Deep Neural Networks with Pruning, T rained Quantization and Huffman Coding. arXiv:1510.00149 [cs] , October 2015. [25] W ei W en, Chunpeng W u, Y andan W ang, Y iran Chen, and Hai Li. Learning Structured Sparsity in Deep Neural Networks. In D. D. Lee, M. Sugiyama, U. V . Luxbur g, I. Guyon, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems 29 , pages 2074–2082. Curran Associates, Inc., 2016. [26] Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Conv olutional Kernel Networks. arXiv:1406.3332 [cs, stat] , June 2014. [27] Corinne Jones, V incent Roulet, and Zaid Harchaoui. K ernel-based T ranslations of Conv olutional Netw orks. arXiv:1903.08131 [cs, math, stat] , March 2019. [28] Grace W ahba. Spline Models for Observational Data . SIAM, September 1990. ISBN 978-0-89871-244-5. [29] Francis R. Bach. Exploring Lar ge Feature Spaces with Hierarchical Multiple K ernel Learning. In D. K oller , D. Schuurmans, Y . Bengio, and L. Bottou, editors, Advances in Neur al Information Pr ocessing Systems 21 , pages 105–112. Curran Associates, Inc., 2009. [30] David K Duv enaud, Hannes Nickisch, and Carl E. Rasmussen. Additi ve Gaussian Processes. In J. Shawe- T aylor, R. S. Zemel, P . L. Bartlett, F . Pereira, and K. Q. W einberger , editors, Advances in Neur al Information Pr ocessing Systems 24 , pages 226–234. Curran Associates, Inc., 2011. [31] Kirthev asan Kandasamy and Y aoliang Y u. Additiv e Approximations in High Dimensional Nonparametric Regression via the SALSA. In International Conference on Mac hine Learning , pages 69–78, June 2016. [32] T rev or Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Infer ence, and Pr ediction . Springer-V erlag Ne w Y ork, New Y ork, NY, 2009. ISBN 978-0-387-84858-7. OCLC: 428882834. [33] Charles J. Stone. Additi ve Regression and Other Nonparametric Models. The Annals of Statistics , 13(2): 689–705, June 1985. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176349548. [34] Charles J. Stone. The Dimensionality Reduction Principle for Generalized Additiv e Models. The Annals of Statistics , 14(2):590–606, June 1986. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1176349940. [35] Geoffre y E. Hinton, Nitish Sriv asta va, Alex Krizhevsky , Ilya Sutskev er , and Ruslan R. Salakhutdinov . Improving neural netw orks by prev enting co-adaptation of feature detectors. arXiv:1207.0580 [cs] , July 2012. [36] Nitish Sri vasta v a. Impr oving Neur al Networks with Dr opout . Uni versity of T oronto, 2013. [37] David Duv enaud, Oren Rippel, Ryan P . Adams, and Zoubin Ghahramani. A voiding pathologies in v ery deep networks. arXiv:1402.5836 [cs, stat] , February 2014. [38] Mathias Lecuyer , V aggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified Robust- ness to Adversarial Examples with Dif ferential Priv ac y . arXiv:1802.03471 [cs, stat] , February 2018. [39] Jeremy M. Cohen, Elan Rosenfeld, and J. Zico K olter . Certified Adversarial Robustness via Randomized Smoothing. arXiv:1902.02918 [cs, stat] , February 2019. [40] Hadi Salman, Gre g Y ang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and Se- bastien Bubeck. Provably Rob ust Deep Learning via Adversarially Trained Smoothed Classifiers. arXiv:1906.04584 [cs, stat] , June 2019. [41] Ramón Huerta and Thomas Nowotn y . Fast and Robust Learning by Reinforcement Signals: Explorations in the Insect Brain. Neural Computation , 21(8):2123–2151, August 2009. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.2009.03- 08- 733. [42] Charles B. Delahunt and J. Nathan Kutz. Putting a bug in ML: The moth olfactory network learns to read MNIST. Neural Networks , 118:54–64, October 2019. ISSN 0893-6080. doi: 10.1016/j.neunet.2019.05.012. [43] Charles B. Delahunt and J. Nathan Kutz. Insect cyborgs: Bio-mimetic feature generators improve machine learning accuracy on limited data. arXiv:1808.08124 [cs, stat] , August 2018. [44] Mattia Rigotti, Omri Barak, Melissa R. W arden, Xiao-Jing W ang, Nathaniel D. Daw , Earl K. Miller, and Stefano Fusi. The importance of mix ed selecti vity in complex cogniti ve tasks. Natur e , 497(7451):585–590, May 2013. ISSN 0028-0836. doi: 10.1038/nature12160. [45] Baktash Babadi and Haim Sompolinsky . Sparseness and Expansion in Sensory Representations. Neuron , 83(5):1213–1226, September 2014. ISSN 0896-6273. doi: 10.1016/j.neuron.2014.07.035. [46] Markus Meister . On the dimensionality of odor space. eLife , 4:e07865, July 2015. ISSN 2050-084X. doi: 10.7554/eLife.07865. [47] Luca Mazzucato, Alfredo F ontanini, and Giancarlo La Camera. Stimuli Reduce the Dimensionality of Cortical Activity. F r ontiers in Systems Neur oscience , 10, 2016. ISSN 1662-5137. doi: 10.3389/fnsys. 2016.00011. [48] Peiran Gao, Eric T rautmann, Byron M. Y u, Gopal Santhanam, Stephen Ryu, Krishna Shenoy , and Surya Ganguli. A theory of multineuronal dimensionality , dynamics and measurement. Nov ember 2017. doi: 10.1101/214262. [49] Francesca Mastrogiuseppe and Srdjan Ostojic. Linking Connectivit y, Dynamics, and Computations in Low-Rank Recurrent Neural Networks. Neur on , 99(3):609–623.e29, August 2018. ISSN 0896-6273. doi: 10.1016/j.neuron.2018.07.003. 6 [50] Matthew S. F arrell, Stefano Recanatesi, Guillaume Lajoie, and Eric Shea-Bro wn. Dynamic compression and expansion in a classifying recurrent network. bioRxiv , page 564476, March 2019. doi: 10.1101/564476. [51] T ong Zhang. Learning Bounds for Kernel Regression Using Effectiv e Data Dimensionality. Neural Computation , 17(9):2077–2098, September 2005. ISSN 0899-7667. doi: 10.1162/0899766054323008. 7 A ppendices: Additive function appr oximation in the brain T able of contents • Appendix A : T est problems and numerical experiments • Appendix B : K ernel examples arising from random features, dense and sparse • Appendix C : K ernel approximation results, uniform con ver gence of Lipschitz features A T est problems W e have implemented sparse random features in Python to demonstrate the properties of learn- ing in this basis. Our code, which provides a scikit-learn style SparseRFRegressor and SparseRFClassifier estimators, is av ailable from https://github.com/kharris/ sparse- random- features . A.1 Additive function appr oximation A.1.1 Comparison with datasets from Kandasamy and Y u [ 1 ] As said in the main text, Kandasamy and Y u [ 1 ] created a theory of the generalization properties of higher-order additive models. They supplemented this with an empirical study of a number of datasets using their Shrunk Additiv e Least Squares Approximation (SALSA) implementation of the additi ve kernel ridge regression (KRR). Their data and code were obtained from https: //github.com/kirthevasank/salsa . W e compared the performance of SALSA to the sparse random feature approximation of the same kernel. W e employ random sparse Fourier features with Gaussian weights N (0 , σ 2 I ) with σ = 0 . 05 · √ dn 1 / 5 in order to match the Gaussian radial basis function used by Kandasamy and Y u [ 1 ] . W e use m = 300 l features for every problem, with regular degree d selected equal to the one chosen by SALSA. The regressor on the features is cross-validated ridge re gression ( RidgeCV from scikit-learn ) with ridge penalty selected from 5 logarithmically spaced points between 10 − 4 · n and 10 2 · n . In Figure 2 , we compare the performance of sparse random features to SALSA. Generally , the training and testing errors of the sparse model are slightly higher than for the kernel method, e xcept for the forestfires dataset. Figure 2: Comparison of sparse random feature approximation to additive k ernel method SALSA [ 1 ]. The parameters were matched between the two models (see text). The sparse feature approximation performs slightly worse than the exact method, b ut similar . 8 Figure 3: Performance of sparse random features of differing degree d and training size n for the polynomial test function. T est error is measured as mean square error with noise floor at 0.0025. As the amount of training data increases, higher d , i.e. more model complexity , is preferred. A.1.2 Polynomial test function sho ws generalization from fewer examples W e studied the speed of learning for a test function as well. The function to be learned f ( x ) was a sparse polynomial plus a linear term: f ( x ) = c 1 a | x + c 2 p ( x ) . The linear term took a ∼ N (0 , I ) , the polynomial p was chosen to hav e 3 terms of degree 3 with weights drawn from N (0 , 1) . The inputs x are drawn from the uniform distribution over [0 , 1] 16 . Gaussian noise with variance 0 . 05 2 was added to generate observ ations y i = f ( x i ) + i . Constants c 1 and c 2 were tuned by setting c 1 = 1 σ lin 1 − α √ α 2 +(1 − α ) 2 and c 2 = 1 σ nonlin α √ α 2 +(1 − α ) 2 , where α = 0 . 05 and σ lin and σ nonlin were the standard deviations of the linear and nonlinear terms alone. For this problem we use random features of varying regular degrees d = 1 , 3 , 10 , 16 and number of data points n . The features use a Fourier nonlinearity h ( · ) = (sin · , cos · ) , weights w ij ∼ N (0 , d − 1 / 2 ) , and biases b i ∼ U ([ − π , π ]) , leading to an RBF kernel in d dimensions. The output regression model is again ridge regression with the penalty selected via cross-validation on the training set from 7 logarithmically spaced points between 10 − 4 and 10 2 . In Figure 3 , we show the test error as well as the selected ridge penalty for dif ferent v alues of d and n . With a small amount of data ( n < 250 ), the model with d = 1 has the lowest test error , since this “simplest” model is less likely to overfit. On the other hand, in the intermediate data regime ( 250 < n < 400 ), the model with d = 3 does best. F or large amounts of data ( n > 400 ), all of the models with interactions d ≥ 3 do roughly the same. Note that with the RBF kernel the RKHS H d ⊆ H d 0 whenev er d ≤ d 0 , so d > 3 can still capture the degree 3 polynomial model. Howe v er , we see that the more complex models have a higher ridge penalty selected. The penalty is able to adaptiv ely control this complexity gi v en enough data. A.2 Stability with respect to sparse input noise Here we show that sparse random features are stable for spike-and-slab input noise. In this example, the truth follows a linear model, where we have random input points x i ∼ N (0 , I ) and linear observations y i = x | i β for i = 1 , . . . , n and β ∼ N (0 , I ) . Howe ver , we only hav e access to sparsely corruputed inputs w i = x i + e i , where e i = 0 with probability 1 − p and e i = x − x i with probability p , x ∼ N (0 , σ 2 I ) . That is, the corrupted inputs are replaced with pure noise. W e use p = 0 . 03 1 and σ = 6 1 so that the noise is sparse b ut large when it occurs. In T able 1 we show the performance of various methods on this regression problem given the corrupted data ( W , y ) . Note that if the practitioner has access to the uncorrupted data X , linear regression succeeds with a perfect score of 1. Using kernel ridge regression with k ( x , x 0 ) = 1 − 1 l k x − x 0 k 1 , the k ernel that arises from sparse random features with d = 1 and sign nonlinearity , leads to improved performance ov er naïve linear regression on the corrupted data or a rob ust Huber loss function. The 9 Model T raining score T esting score Linear 0.854 0.453 Kernel 1.000 0.607 T rim + linear 0.945 0.686 Huber 0.858 0.392 T able 1: Scores ( R 2 coefficient) of v arious regression models on linear data with corrupted inputs. In the presence of these errors, linear regression fails to acheive as good a test score as the kernel method, which is almost as good as trimming before performing regression and better than the robust Huber estimator . Figure 4: Kernel eigen value amplification while (left) varying p with σ = 6 fixed, and (right) varying σ with p = 0 . 03 fixed. Plotted is the ratio of eigenv alues of the kernel matrix corrupted by noise to those without any corruption, ordered from largest to smallest in magnitude. W e see that the sparse feature kernel sho ws little noise amplification when it is sparse (right), e ven for lar ge amplitude. On the other hand, less sparse noise does get amplified (left). best performance is attained by trimming the outliers and then performing linear re gression. Howe ver , this is meant to illustrate our point that sparse random features and their corresponding kernels may be useful when dealing with noisy inputs in a learning problem. In Figure 4 we sho w another way of measuring this stability property . W e compute the eigen v alues of the kernel matrix on a fixed dataset of size n = 800 points both with noise and without noise. Plotted are the ratio of the noisy to noiseless eigen values, in decibels, which we call the amplification and is a measure of how corrupted the kernel matrix is by this noise. The main trend that we see is, for fixed p = 3 , changing the amplitude of the noise σ does not lead to significant amplification, especially of the early eigen values which are of lar gest magnitude. On the other hand, making the outliers denser does lead to more amplification of all the eigenv alues. The eigenspace spanned by the largest eigen v alues is the most “important” for an y learning problem. B K ernel examples B.1 Fully-connected weights W e will now describe a number of common random features and the kernels they generate with fully-connected weights. Later on, we will see ho w these change as sparsity is introduced in the input-hidden connections. T ranslation inv ariant kernels The classical random features [ 2 ] sample Gaussian weights w ∼ N (0 , σ − 2 I ) , uniform biases b ∼ U [ − a, a ] , and employ the F ourier nonlinearity h ( · ) = cos( · ) . This leads to the Gaussian radial basis function kernel k ( x , x 0 ) = exp − 1 2 σ 2 k x − x 0 k 2 , 10 for x , x 0 ∈ [ − a, a ] l . In fact, ev ery translation-in v ariant kernel arises from F ourier nonlinearities for some distributions of weights and biases (Bôchner’ s theorem). Moment generating function ker nels The exponential function is more similar to the kinds of monotone firing rate curves found in biological neurons. In this case, we hav e k ( x , x 0 ) = E exp( w | ( x + x 0 ) + 2 b ) . W e can often ev aluate this expectation using moment generating functions. For example, if w and b are independent, which is a common assumption, then k ( x , x 0 ) = E (exp( w | ( x + x 0 )) · E exp(2 b ) , where E (exp( w | ( x + x 0 )) is the moment generating function for the marginal distribution of w , and E exp(2 b ) is just a constant that scales the kernel. For multi v ariate Gaussian weights w ∼ N ( m , Σ ) this becomes k ( x , x 0 ) = exp m | ( x + x 0 ) + 1 2 ( x + x 0 ) | Σ ( x + x 0 ) · E exp(2 b ) . This equation becomes more interpretable if m = 0 and Σ = σ − 2 I and the input data are normalized: k x k = k x 0 k = 1 . Then, k ( x , x 0 ) ∝ exp σ − 2 x | x 0 ∝ exp − 1 2 σ 2 k x − x 0 k 2 . This result highlights that dot product kernels k ( x , x 0 ) = v ( x | x 0 ) , where v : R → R , are radial basis functions on the sphere S l − 1 = { x ∈ R l : k x k 2 = 1 } . The eigenbasis of these kernels are the spherical harmonics [ 3 , 4 ]. Arc-cosine ker nels This class of kernels is also induced by monotone “neuronal” nonlinearities and leads to different radial basis functions on the sphere [ 3 , 5 , 6 ]. Consider standard normal weights w ∼ N (0 , I ) and nonlinearities which are threshold polynomial functions h ( z ) = Θ( z ) z p for p ∈ Z + , where Θ( · ) is the Hea viside step function. The kernel in this case is giv en by k ( x , x 0 ) = 2 Z R l Θ( w | x )Θ( w | x 0 )( w | x ) p ( w | x 0 ) p e −k w k 2 2 (2 π ) l/ 2 d w = 1 π k x k p k x 0 k p J p ( θ ) , for a kno wn function J p ( θ ) where θ = arccos x | x 0 k x kk x 0 k . Note that arc-cosine kernels are also dot product kernels. Also, if the weights are drawn as w ∼ N (0 , σ − 2 I ) , the terms x are replaced by x /σ , but this does not af fect θ . With p = 0 , corresponding to the step function nonlinearity , we ha ve J 0 ( θ ) = π − θ , and the resulting kernel does not depend on k x k or k x 0 k : k ( x , x ) = 1 − 1 π arccos x | x 0 k x kk x 0 k . (7) Sign nonlinearity W e also consider a shifted version of the step function nonlinearity , the sign function sgn( z ) , equal to +1 when z > 0 , − 1 when z < 0 , and zero when z = 0 . Let b ∼ U ([ a 1 , a 2 ]) and w ∼ P , where P is any spherically symmetric distrib ution, such as a Gaussian. Then, k ( x , x 0 ) = E Z a 2 a 1 d b a 2 − a 1 sgn( w | x − b ) sgn( w | x 0 − b ) = E 1 − 2 | w | x − w | x 0 | a 2 − a 1 = 1 − 2 a 2 − a 1 E | w | ( x − x 0 ) | = 1 − 2 E ( | w | e | ) k x − x 0 k 2 a 2 − a 1 11 where e = ( x − x 0 ) / k x − x 0 k 2 . The factor E ( | w | e | ) in front of the norm is just a function of the radial part of the distribution P , which we should set in versely proportional to √ l to match the scaling of k x − x 0 k 2 . For w ∼ N (0 , σ 2 l − 1 I ) , we obtain k ( x , x 0 ) = 1 − 2 σ r 2 π l k x − x 0 k 2 a 2 − a 1 . (8) B.2 Sparse weights The sparsest networks possible hav e d = 1 , leading to first-order additive kernels. Here we look at two simple nonlinearities where we can perform the sum and obtain an explicit formula for the additi ve k ernel. In both cases, the k ernels are simply related to a rob ust distance metric. This suggests that such kernels may be useful in cases where there are outlier coordinates in the input data. Step function nonlinearity W e again consider the step function nonlinearity h ( · ) = Θ( · ) , which in the case of fully-connected Gaussian weights leads to the degree p = 0 arc-cosine kernel k ( x , x 0 ) = 1 − θ ( x , x 0 ) π . When d = 1 , x N = x i and x 0 N = x 0 i are scalars. For a scalar a , normalization leads to a/ k a k = sgn( a ) . Therefore, θ = arccos (sgn( x i ) sgn( x 0 i )) = 0 if sgn( x i ) = sgn( x 0 i ) and π otherwise. Performing the sum in ( 3 ), we find that the kernel becomes k reg 1 ( x , x 0 ) = 1 − |{ i : sgn( x i ) 6 = sgn( x 0 i ) }| l = 1 − k sgn( x ) − sgn( x 0 ) k 0 l . (9) This kernel is equal to one minus the normalized Hamming distance of v ectors sgn( x ) and sgn( x 0 ) . The fully-connected kernel, on the other hand, uses the full angle between the vectors x and x 0 . The sparsity can be seen as inducing a “quantization, ” via the sign function, on these vectors. Finally , if the data are in the binary hypercube, with x and x 0 ∈ {− 1 , +1 } l , then the kernel is exactly one minus the normalized Hamming distance. Sign nonlinearity W e now consider a slightly dif ferent nonlinearity , the sign function. It will turn out that the kernel is quite dif ferent than for the step function. This has h ( · ) = sgn( · ) = 2Θ( · ) − 1 . Let b ∼ U ([ a 1 , a 2 ]) and w ∼ P . Then, k reg 1 ( x , x 0 ) = 1 l l X i =1 E P Z a 2 a 1 d b a 2 − a 1 sgn( w x i − b )sgn( w x 0 i − b ) = 1 l l X i =1 E P 1 − 2 | w x i − w x 0 i | a 2 − a 1 = 1 − 2 E P ( | w | ) l k x − x 0 k 1 a 2 − a 1 . (10) Choosing P ( w ) = 1 2 δ ( w + 1) + 1 2 δ ( w − 1) and a 2 = − a 1 = a recov ers the “random stump” result of Rahimi and Recht [ 2 ] . Despite the fact that sign is just a shifted v ersion of the step function, the kernels are quite dif ferent: the sign nonlinearity does not exhibit the quantization effect and depends on the ` 1 -norm rather than the ` 0 -“norm”. C K ernel appr oximation results W e now sho w a basic uniform con vergence result for any random features, not necessarily sparse, that use Lipschitz continuous nonlinearities. Recall the definition of a Lipschitz function: Definition 1. A function f : X → R is said to be L -Lipschitz continuous (or Lipsc hitz with constant L ) if | f ( x ) − f ( y ) | ≤ L k x − y k holds for all x , y ∈ X . Her e, k · k is a norm on X (the ` 2 -norm unless otherwise specified). Assuming that h is Lipschitz and some regularity assumptions on the distribution µ , the random feature expansion approximates the kernel uniformly o ver X . As far as we are aware, this result has 12 not been stated previously , although it appears to be known (see Bach [ 7 ] ) and is very similar to Claim 1 in Rahimi and Recht [ 2 ] which holds only for random Fourier features (see also Sutherland and Schneider [ 8 ] and Sriperumbudur and Szabo [ 9 ] for improved results in this case). The rates we obtain for Lipschitz nonlinearities are not essentially dif ferent than those obtained in the Fourier features case. As for the examples we hav e gi ven, the only ones which are not Lipschitz are the step function (order 0 arc-cosine kernel) and sign nonlinearities. Since these functions are discontinuous, their con ver gence to the kernel occurs in a weak er than uniform sense. Howe ver , our result does apply to the rectified linear nonlinearity (order 1 arc-cosine kernel), which is non-differentiable at zero but 1-Lipschitz and widely applied in artificial neural networks. The proof of the following Theorem appears at the end of this section. Theorem 1 (Kernel approximation for Lipschitz nonlinearities) . Assume that x ∈ X ⊂ R l and that X is compact, ∆ = diam( X ) , and the null vector 0 ∈ X . Let the weights and biases ( w , b ) follow the distribution µ on R l +1 with finite second moments. Let h ( · ) be a nonlinearity which is L -Lipschitz continuous and define the random featur e φ : R l → R by φ ( x ) = h ( w | x − b ) . W e assume that the following hold for all x ∈ X : | φ ( x ) | ≤ κ almost sur ely , E | φ ( x ) | 2 < ∞ , and E φ ( x ) φ ( x 0 ) = k ( x , x 0 ) . Then sup x , x 0 ∈X 1 m φ ( x ) | φ ( x 0 ) − k ( x , x 0 ) ≤ with pr obability at least 1 − 2 8 κL ∆ p E k w k 2 + 3( E k w k ) 2 ! 2 exp − m 2 4(2 l + 2) κ 2 . Sample complexity Theorem 1 guarantees uniform approximation up to error using m = Ω lκ 2 2 log C features. This is precisely the same dependence on l and as for random Fourier features. A limitation of Theorem 1 is that it only shows approximation of the limiting kernel rather than direct approximation of functions in the RKHS. A more detailed analysis of the con ver gence to RKHS is contained in the work of Bach [ 7 ] , whereas Rudi and Rosasco [ 10 ] directly analyze the generalization ability of these approximations. Sun et al. [ 11 ] sho w ev en faster rates which also apply to SVMs, assuming that the features are compatible (“optimized”) for the learning problem. Also, the techniques of Sutherland and Schneider [ 8 ] and Sriperumbudur and Szabo [ 9 ] could be used to improv e our constants and prov e con ver gence in other L p norms. In the sparse case, we must extend our probability space to capture the randomness of (1) the degrees, (2) the neighborhoods conditional on the degree, and (3) the weight vectors conditional on the degree and neighborhood. The degrees are distributed independently according to d i ∼ D , with some abuse of notation since we also use D ( d ) to represent the probability mass function. W e shall always think of the neighborhoods N ∼ ν | d as chosen uniformly among all d element subsets, where ν | d represents this conditional distrib ution. Finally , gi ven a neighborhood of some degree, the nonzero weights and bias are dra wn from a distribution ( w , b ) ∼ µ | d on R d +1 . For simpler notation, we do not show any dependence on the neighborhood here, since we will always take the actual weight values to not depend on the particular neighborhood N . Howe ver , strictly speaking, the weights do depend on N because that determines their support. Finally , we use E to denote expectation ov er all variables (degree, neighborhood, and weights), whereas we use E µ | d for the expectation under µ | d for a giv en degree. Corollary 2 (Kernel approximation with sparse features) . Assume that x ∈ X ⊂ R l and that X is compact, ∆ = diam( X ) , and the null vector 0 ∈ X . Let the degr ees d follow the de gr ee distribution D on [ l ] . F or every d ∈ [ l ] , let µ | d denote the conditional distrib utions for ( w , b ) on R d +1 and assume that these have finite second moments. Let h ( · ) be a nonlinearity which is L -Lipschitz continuous, and define the random featur e φ : R l → R by φ ( x ) = h ( w | x − b ) , wher e w follows the de gr ee distrib ution model. W e assume that the following hold for all x N ∈ X N with |N | = d , and for all 1 ≤ d ≤ l : | φ ( x N ) | 2 ≤ κ almost sur ely under µ | d , E | φ ( x N ) | 2 | d < ∞ , and E [ φ ( x N ) φ ( x 0 N ) | d ] = k reg d ( x N , x 0 N ) . 13 Then sup x , x 0 ∈X 1 m φ ( x ) | φ ( x 0 ) − k dist D ( x , x 0 ) ≤ , with pr obability at least 1 − 2 8 κL ∆ p E k w k 2 + 3( E k w k ) 2 ! 2 exp − m 2 4(2 l + 2) κ 2 . The kernels k reg d ( z , z 0 ) and k dist D ( x , x 0 ) ar e given by equations ( 3 ) and ( 4 ) . Pr oof. It suf fices to sho w that conditions (1–3) on the conditional distrib utions µ | d , d ∈ [ l ] , imply conditions (1–3) in Theorem 1 . Conditions (1) and (2) clearly hold, since the distribution D has finite support. By construction, E φ ( x ) φ ( x 0 ) = E [ E [ φ ( x N ) φ ( x 0 N ) | d ]] = E [ k reg d ( x N , x 0 N )] = k dist D ( x , x 0 ) , which concludes the proof. Differences of sparsity The only difference we find with sparse random features is in the terms E k w k 2 and E k w k , since sparsity adds v ariance to the weights. This suggests that scaling the weights so that E µ | d k w k 2 is constant for all d is a good idea. For example, setting ( w i ) N i ∼ N (0 , σ 2 d − 1 i I d i ) , the random variables k w i k 2 ∼ σ 2 d − 1 i χ 2 ( d i ) and k w i k ∼ σ d − 1 / 2 i χ ( d i ) . Then E k w i k 2 = σ 2 irregardless of d i and E k w i k = σ (1 + o ( d i )) . With this choice, the number of sparse features needed to achie ve an error is the same as in the dense case, up to a small constant f actor . This is perhaps remarkable since there could be as many as P l d =0 l d = 2 l terms in the e xpression of k dist D ( x , x 0 ) . Howe v er , the random feature expansion does not need to approximate all of these terms well, just their av erage. Pr oof of Theor em 1 . W e follow the approach of Claim 1 in [ 2 ], a similar result for random F ourier features but which crucially uses the fact that the trigonometric functions are dif ferentiable and bounded. For simplicity of notation, let ξ = ( x , x 0 ) and define the dir ect sum norm on X + = X ⊕ X as k ξ k + = k x k + k x 0 k . Under this norm X + is a Banach space but not a Hilbert space, howe v er this will not matter . For i = 1 , . . . , m , let f i ( ξ ) = φ i ( x ) φ i ( x 0 ) , g i ( ξ ) = φ i ( x ) φ i ( x 0 ) − k ( x , x 0 ) = f i ( ξ ) − E f i ( ξ ) , and note that these g i are i.i.d., centered random variables. By assumptions (1) and (2), f i and g i are absolutely integrable and k ( x , x 0 ) = E φ i ( x ) φ i ( x 0 ) . Denote their mean by ¯ g ( ξ ) = 1 m φ ( x ) | φ ( x 0 ) − k ( x , x 0 ) = 1 m m X i =1 g i ( ξ ) . Our goal is to show that | ¯ g ( ξ ) | ≤ for all ξ ∈ X + with sufficiently high probability . The space X + is compact and 2 l -dimensional, and it has diameter at most twice the diameter of X under the sum norm. Thus we can cover X + with an -net using at most T = (4∆ /R ) 2 l balls of radius R . Call the centers of these balls ξ i for i = 1 , . . . , T , and let ¯ L denote the Lipschitz constant of ¯ g with respect to the sum norm. Then we can show that | ¯ g ( ξ ) | ≤ for all ξ ∈ X + if we show that 1. ¯ L ≤ 2 R , and 2. | ¯ g ( ξ i ) | ≤ 2 for all i . First, we bound the Lipschitz constant of g i with respect to the sum norm k · k + . Since h is L -Lipschitz, we hav e that φ i is Lipschitz with constant L k w i k . Thus, letting ξ 0 = ξ + ( δ , δ 0 ) , 2 | f i ( ξ ) − f i ( ξ 0 ) | ≤ | φ i ( x + δ ) φ i ( x 0 + δ 0 ) − φ i ( x + δ ) φ i ( x 0 ) | + | φ i ( x + δ ) φ i ( x 0 + δ 0 ) − φ i ( x ) φ i ( x 0 + δ 0 ) | + | φ i ( x + δ ) φ i ( x 0 ) − φ i ( x ) φ i ( x 0 ) | + | φ i ( x ) φ i ( x 0 + δ 0 ) − φ i ( x ) φ i ( x 0 ) | ≤ 2 L k w i k · sup x ∈X | φ i ( x ) | · ( k δ k + k δ 0 k ) = 2 κL k w i k · k ξ − ξ 0 k + , 14 we have that f i has Lipschitz constant κL k w i k . This implies that g i has Lipschitz constant ≤ κL ( k w i k + E k w k ) . Let ¯ L denote the Lipschitz constant of ¯ g . Note that E ¯ L ≤ 2 κL E k w k . Also, E ¯ L 2 ≤ L 2 κ 2 E ( k w k + E k w k ) 2 = L 2 κ 2 E k w k 2 + 3( E k w k ) 2 . Markov’ s inequality states that Pr[ ¯ L 2 > t 2 ] ≤ E [ ¯ L 2 ] /t 2 . Letting t = 2 R , we find that Pr[ ¯ L > t ] = Pr h ¯ L > 2 R i ≤ L 2 κ 2 E k w k 2 + 3( E k w k ) 2 2 R 2 . (11) Now we would like to show that | ¯ g ( ξ i ) | ≤ / 2 for all i = 1 , . . . , T anchors in the -net. A straightforward application of Hoef fding’ s inequality and a union bound sho ws that Pr h | ¯ g ( ξ i ) | > 2 for all i i ≤ 2 T exp − m 2 8 κ 4 , (12) since | f i ( ξ ) | ≤ κ 2 . Combining equations ( 11 ) and ( 12 ) results in a probability of failure Pr " sup ξ ∈X + | ¯ g ( ξ ) | ≥ # ≤ 2 4∆ R 2 l exp − m 2 8 κ 2 + L 2 κ 2 ( E k w k 2 + 3( E k w k ) 2 ) 2 R 2 = aR − 2 l + bR 2 . (13) Set R = ( a/b ) 1 2 l +2 , so that the probability ( 13 ) has the form, 1 − 2 a 2 2 l +2 b 2 l 2 l +2 . Thus the probability of failure satisfies Pr " sup ξ ∈X + | ¯ g ( ξ ) | ≥ # ≤ 2 a 2 2 l +2 b 2 l 2 l +2 = 2 · 2 2 2 l +2 8 κL ∆ p E k w k 2 + 3( E k w k ) 2 ! 4 l 2 l +2 exp − m 2 4(2 l + 2) κ 2 ≤ 2 8 κL ∆ p E k w k 2 + 3( E k w k ) 2 ! 2 exp − m 2 4(2 l + 2) κ 2 , for all l ∈ N , assuming ∆ κL p E k w k 2 + 3( E k w k ) 2 > . Considering the complementary ev ent concludes the proof. References [1] Kirthev asan Kandasamy and Y aoliang Y u. Additiv e Approximations in High Dimensional Nonparametric Regression via the SALSA. In International Conference on Mac hine Learning , pages 69–78, June 2016. [2] Ali Rahimi and Benjamin Recht. Random Features for Large-Scale Kernel Machines. In J. C. Platt, D. K oller , Y . Singer , and S. T . Roweis, editors, Advances in Neural Information Pr ocessing Systems 20 , pages 1177–1184. Curran Associates, Inc., 2008. [3] Alex J. Smola, Zoltán L. Óvári, and Robert C W illiamson. Regularization with Dot-Product K ernels. In T . K. Leen, T . G. Dietterich, and V . T resp, editors, Advances in Neural Information Processing Systems 13 , pages 308–314. MIT Press, 2001. [4] Francis Bach. Breaking the Curse of Dimensionality with Con vex Neural Netw orks. Journal of Mac hine Learning Resear ch , 18(19):1–53, 2017. [5] Y oungmin Cho and Lawrence K. Saul. Kernel Methods for Deep Learning. In Y . Bengio, D. Schuurmans, J. D. Lafferty , C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Pr ocessing Systems 22 , pages 342–350. Curran Associates, Inc., 2009. [6] Y oungmin Cho and Lawrence K. Saul. Analysis and Extension of Arc-Cosine K ernels for Large Mar gin Classification. arXiv:1112.3712 [cs] , December 2011. [7] Francis Bach. On the Equiv alence Between K ernel Quadrature Rules and Random Feature Expansions. J. Mach. Learn. Res. , 18(1):714–751, January 2017. ISSN 1532-4435. 15 [8] Dougal J. Sutherland and Jeff Schneider . On the Error of Random Fourier Features. arXiv:1506.02785 [cs, stat] , June 2015. [9] Bharath Sriperumb udur and Zoltan Szabo. Optimal Rates for Random Fourier Features. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neur al Information Pr ocessing Systems 28 , pages 1144–1152. Curran Associates, Inc., 2015. [10] Alessandro Rudi and Lorenzo Rosasco. Generalization Properties of Learning with Random Features. In I. Guyon, U. V . Luxbur g, S. Bengio, H. W allach, R. Fergus, S. V ishwanathan, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems 30 , pages 3215–3225. Curran Associates, Inc., 2017. [11] Y itong Sun, Anna Gilbert, and Ambuj T ewari. But How Does It W ork in Theory? Linear SVM with Random Features. arXiv:1809.04481 [cs, stat] , September 2018. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment