Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

Beating the P erils of Non-Con v exit y: Guaran teed T raining of Neural Net w orks using T ensor Metho ds Ma jid Janzamin ∗ Hanie Sedghi † Anima Anandkum a r ‡ Abstract T raining neural net works is a c hallenging non-conv ex optimization problem, and bac kpr opa- gation or gradient descen t c an get stuc k in s pur ious local o ptima. W e pro pose a nov el algorithm based on tensor decomp osition for guaranteed t r aining of tw o-layer neural netw orks . W e pr o- vide ris k bo unds for o ur prop osed metho d, with a p olynomial sample co mplexit y in the relev ant parameters , such as input dimensio n and num b er of neurons. While learning arbitra ry tar- get functions is NP-hard, we provide transparent conditions o n the function and the input fo r learnability . Our training metho d is based on tensor decomp osition, which pr o v ably conv erges to the global optimum, under a set of mild non-dege ne r acy conditions. It consists of simple emb a rrassingly parallel linear a nd m ulti-linear op erations, a nd is comp etitiv e with standard sto c hastic gr adien t descent (SG D), in terms of c o mputational complexit y . Thus, w e pr o pose a computationally eﬃcient method with guara n teed risk bo unds for training neural netw or ks with one hidd e n layer. Keyw ords: neural netw orks, risk b ound, metho d-of-momen ts, tensor decomp osition 1 In tro duction Neural net w orks ha ve revol u tio n ize d p erformance a cross m u lti p le domains suc h as compu te r vision and sp eec h recognitio n . They are ﬂ exible mo dels trained to appro ximate an y arbitrary target function, e.g., the lab el f unction for classiﬁcation tasks. They are comp osed of m ultiple la yers of neur ons or activating functions , whic h are applied recur siv ely on the inpu t data, in order to predict the output. While neural net w orks ha ve b een extensive ly employ ed in practice, a complete theoretical u nderstanding is currently lac king. T raining a neural net wo r k can b e fr amed as an optimization pr oblem, wh ere the netw ork pa- rameters are chosen to min im ize a giv en loss fu nction, e.g., the quadr atic loss function o ver the error in predicting th e output. The p erform an ce of training alg orith m s is t yp ica lly measured through the notion of risk , whic h is the exp ec ted loss fun ction o v er unseen test data . A natur al qu estion to ask is th e hardn ess of training a neur al net work with a b ounded risk. The ﬁndings are mostly neg- ativ e (Ro j as, 1996; ˇ S ´ ıma, 200 2 ; Blum and Rivest, 19 93 ; Ba r tle tt and Ben-Da vid , 1 999 ; K uhlmann, 2000). T raining ev en a simple n et work is NP-hard , e.g., a net w ork with a sin gle neuron ( ˇ S ´ ıma, 2002). ∗ Universit y of California, Irv ine. Email: mjanzami@uci.edu † Universit y of California, Irv ine. Email: sedghih@uci.edu ‡ Universit y of California, Irv ine. Email: a.anandkumar@u ci.edu 1 The computational hardness of training is du e to the non-conv exit y of the loss fun ct ion. I n general, the loss f unction has many critic al p oints , whic h include spurious lo c al optima and sadd le p oints . In addition, w e face curse of dimensionality , and the n umb er of cr itical p oin ts gro w s exp onen tially with the inpu t dimension for general non-conv ex p roblems (Dauph in et al., 2014). P opular lo cal searc h metho ds such as gradien t descent or b ackpr op agation can get stuck in bad lo cal optima and exp erience arb itrarily slo w con vergence. Explicit examples of its failure and the p resence of bad lo cal op tima in ev en s imple separable settings ha ve b een d ocumented b efore (Brady et al. , 1989; Gori and T esi, 1992; F rasconi et al., 1997); see Section 6.1 for a discuss io n . Alternativ e metho ds for training neural n et works hav e b een mostly limited to sp eciﬁc activ a- tion fu nctions (e.g., linear or quadratic), sp eciﬁc target functions (e.g., p olynomials) (Andoni et al., 2014), or assume strong assumptions on the in put ( e.g., Gaussian or pro duct distribution) (Andoni et al. , 2014), see related work for details. T h us, up until now, there is n o uniﬁed framewo r k for training net works with g en eral input, output and ac tiv ation functions, for whic h we can pro vide g u aran teed risk b ound. In this pap er, f or the ﬁr st time, we pr esen t a guarantee d f ramew ork for learning general tar- get fun cti on s using neural netw orks, and sim ultaneously o vercome computational, statistical, and appro ximation c hallenges. In other wo r ds, our metho d h as a lo w computational and sample com- plexit y , ev en as the dimension of the optimiza tion grows, and in addition, can also handle approxi - mation errors, wh en the target function ma y not b e generated by a giv en neural net work. W e prov e a guarante ed risk b ound for our p rop ose d m et h od. NP-hardness refers to the computational com- plexit y of training worst-ca se instances. In stea d , we pro vide transp aren t conditions on th e target functions and the inpu ts for tractable learning. Our training metho d is based on t h e m et h od of moment s, whic h inv olves decomp osing the em- pirical cross moment b et w een output and some function of input. While pairwise m oments are represent ed using a matrix, h ig h er order moments requ ir e tensors, and the learning p roblem can b e form u la ted as tensor decomp osition. A CP (CanDecomp/P arafac) decomp osition of a tensor in- v olv es ﬁnding a succinct su m of rank-one comp onen ts that b est ﬁt the input tensor. Even though it is a non-conv ex pr oblem, the global optimum of tensor decomp osition can b e achiev ed u sing compu- tationally eﬃcient tec hniqu es, und er a set of mild non-d eg eneracy conditions (Anandkum ar e t al., 2014a,b,c, 2015; Bhask ara et al., 2013). These metho ds hav e b een recen tly emp lo yed for learning a wide r ange of laten t v ariable mo dels (Anandkumar et al., 2014b, 2013). Incorp orating tensor metho ds for training neural netw orks requires addressing a num b er of n on- trivial questions: What form of moments are informative ab out net wo r k parameters? Earlier wo r ks using tensor metho ds for learning assume a linear r ela tionship b et ween the h idden and obs er ved v ariables. Ho w ever, neural net wo r ks p ossess non-linear activ ation functions. Ho w do we adapt tensor metho ds f or this setting? Ho w do these metho ds b eha v e in the presence of appr o x im ation and sa m ple p erturbations? Ho w ca n we establish risk b ounds? W e address t h ese questions shortly . 1.1 Summary of Results The main contributions are: (a) we prop ose an eﬃcien t algorithm f or training neural n et works, termed as Neural Net w ork-LearnI ng us ing F eature T ensors (NN-LIFT), (b) w e demonstrate that the metho d is em barr assingly parallel and is comp etitiv e with standard SGD in terms of computational complexit y , and as a m ain result, (c) w e establish that it has b ound ed risk, w hen the n umb er of training samples scales p olynomially in relev ant parameters suc h as input dimension and num b er of neu r ons. 2 W e analyze training of a tw o-la y er feedf orward neu ral net wo r k, where the second la y er has a linear activ ation function. This is th e classical n eural net wo r k considered in a n umb er of w orks (Cyb enko, 19 89b ; Hornik, 1991; Barron, 1994), and a natural starting p oin t for th e analysis of an y learning algorithm. Note that training eve n this tw o-la y er netw ork is non-conv ex, and ﬁnding a c omp utati onally eﬃcien t metho d with guarant eed risk b ound has b ee n an op en problem up unt il no w. A t a high lev el, NN -LIFT estimates the we ights of the ﬁ rst la yer using tensor C P decomp ositio n . It then u ses these estimates to learn the bias parameter of ﬁr st la yer usin g a simp le F ourier tech- nique, and ﬁ n ally estimates the p aramet ers of last la yer usin g linear regresion. NN-LIFT consists of simp le linear and multi-l in ea r op erations (Anand kumar et al., 2014b,c, 201 5 ), F ourier analysis and ridge regression analysis, whic h are parallelizable to large-scale d at a sets. Th e computational complexit y is comparable to that of the stand ard SGD; in fact, the parallel time complexit y for b oth the metho ds is in the same order, and our metho d requires more pro cessors than SGD by a m ultiplicativ e factor that scales linearly in the input dimension. Generativ e vs. discriminativ e mo dels: Generativ e mo dels incorp orate a join t distribution p ( x, y ) ov er b oth the input x and label y . On the other hand, discriminativ e mo dels such as neural net works only incorp orate the conditional distribution p ( y | x ). Wh ile training neural net works for general input x is NP-hard, do es kno wledge about the input distribution p ( x ) make learning tractable? In this w ork, we assume kn owledge of the input densit y p ( x ), whic h can b e any con tinuous diﬀeren tiable fun cti on . Unlike man y theoretical w orks , e.g., (And oni et al., 2014), we do not limit ourselv es to d istributions suc h as pro duct or Gaussian distribu tio n s for the in put. While unsu - p ervised learning, i.e., estimation of density p ( x ), is itself a hard problem for general mo dels, in this w ork, we in v estigate ho w p ( x ) can b e exploited to mak e training of neur al netw orks tractable. The kno wledge of p ( x ) is naturally a v ailable in the exp erimental design framework, where the p er- son designin g the exp erimen ts has the ability to c ho ose the in p ut d istribution. Examp les in cl u de conducting p olling, carrying out dr ug trial s, collecting survey information, and so on. Utilizing generat iv e mo dels on the input via score functions: W e utilize the kno w ledge ab out the inp u t densit y p ( x ) (up to normalizat ion) 1 to obtain c ertain (non-linear) transform at ions of the inpu t, giv en by th e class of score fun ct ions. Sc or e functions are normalized deriv ativ es of the input p df; see (8). If the input is a v ector (the t ypical case), the ﬁrst ord er sco r e f u nction (i.e., the ﬁr st deriv ativ e) is a v ector, the second order score is a matrix, and the higher ord er scores are tensors. In our NN-LIFT metho d, w e ﬁrst estimate the cross-momen ts b et w een the output a n d the input score fu nctions, and then decomp ose it to rank-1 comp onen ts. Risk b ounds: Risk b ound includes b oth appro ximation and estimation errors . T he appro xima- tion error is th e err or in ﬁ tti n g th e target function to a neural net work of give n arc h ite ctur e, and the estimation error is the error in estimating the w eight s of that neural n et work using the giv en samples. W e ﬁrst consider the r e alizable setting where the target function is generated b y a t wo -lay er neural netw ork (with hidd en la yer of n eurons consisting of any general sigmoidal activ ations), an d 1 W e do not require the know ledge of t he normalizing constant or t he partition function, which is # P hard t o compute (W ainwrigh t and Jo rd an, 2008) . 3 a linear output la y er. Note that th e approximati on error is zero in this setting. Let A 1 ∈ R d × k b e the weig ht matrix of ﬁrst lay er (connecting the inp ut to the neurons ) with k denoting the n u m b er of neu r ons and d denoting the input dimens io n . Su pp ose these we ight ve ctors are n on-dege n erate , i.e., the weigh t m atrix A 1 (or its tensorization) is full column r an k . W e assum e contin uous input distribution with access to score functions, whic h are b ound ed on an y set of non-zero measure. W e allo w for an y general sigmoidal activ ation functions with non-zero t h ird deriv ativ es in exp ectatio n , and satisfying Lipsc hitz p rop ert y . Let s min ( · ) b e the m in im um sin gu lar v alue op erator, and M 3 ( x ) ∈ R d × d 2 denote the matricization of input score fun cti on tensor S 3 ( x ) ∈ R d × d × d ; s ee (1) and (8) for the deﬁnitions. F or the Gaussian input x ∼ N (0 , I d ), w e hav e E    M 3 ( x ) M ⊤ 3 ( x )    = ˜ O  d 3  . W e ha ve the follo wing learning result in the realiza b le setting wh ere the target function is generated b y a t wo la yer n eural net w ork (with one hidden lay er ). Theorem 1 (Informal result for realizable setting) . Our metho d NN-LIFT le arns a r e alizable tar get function up to err or ǫ when the numb er of samples is lower b ounde d as 2 , n ≥ ˜ O  k ǫ 2 · E h    M 3 ( x ) M ⊤ 3 ( x )    i · s 2 max ( A 1 ) s 6 min ( A 1 )  . Th u s, we c an eﬃcient ly learn the neural net work parameters with p olynomial sample complex- it y using NN-LIFT algorithm. In addition, the metho d has p olynomial computational complexit y , and in fact, its p arall el time complexit y is the same as sto c hastic gradien t d esce nt (SGD) or b ac k- propagation. See Theorem 3 for the formal result. W e then extend our results to the non-r e alizable setting where the target function ne e d not b e generated by a n eural net wo r k. F or our metho d NN-LIFT to succeed, we require th e approxi m at ion error to b e suﬃcient ly small under the giv en netw ork archite ctur e. Note th at it is n ot of practical in terest to consid er fun ctions with large appro ximation errors, since classiﬁcatio n p erformance in that case is p o or (Ba r tle tt , 1998). W e s ta te the in formal v ersion of the resu lt as follo ws. W e assume the follo w in g: the target fun cti on f ( x ) has a con tinuous F ourier sp ectrum and is suﬃcien tly smo oth, i.e., the parameter C f (see (13) for the d eﬁ nitio n ) is suﬃcien tly small as s p eciﬁed in (18 ). This implies that the approximat ion er r or of the target function can b e controlle d , i.e., there exists a neural net work of giv en size that can ﬁt th e target fu nction with b ounded appro x im ation error. Let the in put x b e b ounded as k x k ≤ r . O ur in formal result is as follo ws . See Theorem 5 for the formal result. Theorem 2 (Informal resu lt for non-realizable setting) . The arbitr ary tar get fu nc tion f ( x ) is appr oximate d by the neur al network ˆ f ( x ) which is le arnt using NN- LIFT algorithm such that the risk b ound satisﬁes w.h.p. E x [ | f ( x ) − ˆ f ( x ) | 2 ] ≤ O ( r 2 C 2 f ) ·  1 √ k + δ 1  2 + O ( ǫ 2 ) , wher e k is the numb er of neur ons in the neur al network, and δ τ is deﬁne d in (16) . In the ab o v e boun d, w e r equire for the target fu nction f ( x ) to h a ve b ounded ﬁ r st order momen t in th e F ourier sp ectrum ; see ( 18 ). As an example, w e sho w that this b ound is sati s ﬁed for the class of scale and lo cati on mixtures of the Gaussian k ernel function. 2 Here, only the dominant terms in the sample complexity are noted; see (12) for the full details. 4 Corollary 1 (Learn in g m ixtures of Gaussian kernels) . L et f ( x ) := R K ( α ( x + β )) G ( dα, dβ ) , α > 0 , β ∈ R d , b e a lo c ation and sc ale mixtur e of the Gaussian kernel function K ( x ) = exp  − k x k 2 2  , the input b e Gaussian as x ∼ N (0 , σ 2 x I d ) , and the activations b e step functions, then, our algorithm tr ains a neur al network with risk b ounds as in The or em 2, when Z | α | · | G | ( dα, dβ ) ≤ p oly  1 d , 1 k , ǫ, 1 σ x , exp  − 1 /σ 2 x   . W e observ e th at when the kernel m ixtu res corresp ond to smo other functions (smaller α ), the ab o v e b ound is more like ly to b e satisﬁed. This is in tuitive sin ce s moother functions ha v e lo wer amoun t of high frequency con ten t. Also, notice that th e ab o ve b ound h as a dep enden ce on the v ariance of the Gaussian input σ x . W e obtain the most relaxed b ound (r .h .s. of ab o ve b ound ) for middle v alues of σ x , i.e., when σ x is neither to o large n or to o small. See App endix D.1 for m ore discussion and the pro of of the corollary . In tuit ions behind the conditions for the risk b ound: Since there e x ist w orst-case in stances where learning is hard , it is natural to exp ect that NN-LIFT has guarantees only wh en certain conditions are met. W e assume that the inp ut has a regular con tin u ous p robabilit y densit y function (pd f ); see (11) for the details. This is a reasonable assumption, sin ce under Bo olean inputs (a sp ecial case of discrete in put), it reduces to learnin g p arit y with noise wh ic h is a hard problem (Kearn s and V azirani, 1994). W e assume that the activ ating functions are suf- ﬁciently non-line ar , since if they are linear, t h en the n etw ork can b e colla p sed into a sin gle la y er (Baldi and Horn ik, 1989), whic h is non-identiﬁable. W e precisely c haracterize ho w the es- timation error dep ends on th e non-linearit y of the activ ating fun ct ion through its third ord er deriv ativ e. Another condition for pro vidin g the risk b ound is n on-redundancy of the neurons. If the neur on s are redund an t, it is an o ver-speciﬁed net w ork. In the realizable setting, where the t arget function is generated by a neural n et work w ith the give n n u m b er of n eurons k , we require (tensorizations of ) the w eights of ﬁrst la y er to b e linearly ind epend en t. In the non-realizable setting, we require this to b e satisﬁed by k vecto r s randomly drawn from the F ourier magnitude distribu tio n (weigh ted by th e norm of frequency v ector) of the target function f ( x ). More precisely , th e random frequ en ci es are dra wn from p robabilit y distribu tio n Λ( ω ) := k ω k · | F ( ω ) | /C f where F ( ω ) is th e F ourier transform of arbitrary function f ( x ), and C f is the normalization factor; see (26) and corresp ondin g discussions for more details. This is a m ild condition whic h h olds when the distribution is con tinuous in some domain. Th u s, our conditions for ac hieving b ounded r isk are mild and encompass a large class of target fun ctio n s and inp ut distributions. Wh y tensors are required? W e emplo y the cross-momen t t en sor which enco d es the correlation b et w een the thir d order score function and the output. W e then d eco mp ose the momen t tens or as a sum of rank-1 comp onen ts to yield the w eigh t v ectors of the ﬁrst la y er. W e requir e at least a third order tensor to learn the n eu ral n et work w eigh ts for the follo wing reasons: while a ma- trix decomp osition is only identiﬁable up to orthogonal comp onents, tensors can ha ve iden tiﬁable non-orthogonal comp onen ts. In general, it is not r ea listic to assume that the wei ght v ectors are orthogonal, and hence, w e requ ire tens ors to learn the w eight v ectors. Moreo ver, through ten- sors, we can learn o v er complete net w orks, where the n umb er of h idden neurons can exceed the 5 input/output dimensions. Note that matrix factorization metho ds are u nable to learn o v ercom- plete m odels, since the rank of the matrix cannot exceed its dimensions. Thus, it is critical to incorp orate tensors for training n eural net works. A recen t s et of pap ers hav e analyzed the tensor metho ds in detail, and established conv ergence and p erturb at ion guaran tees (Anand kumar et al., 2014a,b,c, 2015; Bhask ara et al., 2013), despite non-con v exity of the decomposition problem. Such strong theoretical guaran tees are essential for derivin g pro v able r isk b ounds for NN-LIFT. Extensions: Our algorithm NN-LIFT can b e extended to more la ye r s, by recursively estimating the w eight s la y er b y la yer. In pr inciple, our analysis can b e extended by con trolling the p erturbation in tro duced due to la y er-by-l ay er estimation. Establishing pr ec ise guaran tees is an exciting op en problem. In this w ork, we assume kno wledge of the generativ e mo del for th e input. As argued b efore, in m an y settings su c h as exp erimenta l design or p olling, the d esign of the inp u t p df p ( x ) is u nder the con trol of the learner. Ev en if p ( x ) is n ot kn own, a recent ﬂu rry of researc h activit y has sho wn that a wide class of pr ob ab ilistic models can be t r ai n ed consisten tly u sing a suite of diﬀeren t eﬃcien t algorithms: conv ex relaxation metho ds (Chandrasek aran et al., 2010), sp ectral and tensor metho ds (Anandk u mar et al., 2014b), alternating minimization (Agarw al et al., 2014), and they require only p olynomial sample and computational complexit y , with resp ect to the input and hidden dimensions. These metho ds can learn a r ic h class of mo dels wh ich also includes laten t or hidden v ariable mo dels. Another asp ect n ot addressed in this w ork is the issu e of regularization for our NN-LIFT algorithm. In this w ork, we a ss ume that the num b er of neurons is chosen appropriately to balance bias and v ariance through cross v alidation. Designing implicit regularization metho ds suc h as dr op out (Hin ton et al., 2012) or e arly sto pping (Morgan and Bourlard, 1989 ) f or tensor factorization and analyzing them rigorously is another exciting op en researc h pr ob lem. 1.2 Related w orks W e ﬁ rst review s ome wo r k s regarding the an alysis of bac kpropagation, and th en pro vid e some theoretical r esu lts on training neur al net w orks. Analysis of ba c kpropagation and loss surface of optimization: Baldi and Horn ik (198 9 ) sho w that if the activ ations are linear, then bac kpropagation has a u nique lo cal optim u m, and it corresp onds to the principal comp onen ts of the co v ariance matrix of the training examples. Ho w ever, it is kno w n that there exist net wo r ks with non-linear activ ations wher e backpropaga tion fails; for instance, Brady et al. (1989) constru ct simple cases of linearly separable classes that bac kpropagation f ail s. Note that the simp le p erceptron algorithm will succeed here du e to linear separabilit y . Gori and T esi (1992) argue that suc h examples are artiﬁcial and that bac kp ropaga tion succeeds in reac hing the global optimum for linearly s eparable classes in practical settings. Ho we ver, they show that und er non-linear sep arab ility , bac kpropagation can get stuck in lo cal optima. F or a detailed su rv ey , see F rasconi et al. (1997). Recen tly , Choromansk a et al. (2015) analyze the lo s s sur face of a m ulti-la y er ReLU net work b y relating it to a spin glass system. Th ey mak e several assumptions such as v ariable indep endence for the inp ut, equally lik ely paths from inp ut to output, redund ancy in net work parameterization and uniform distribution for unique weig hts, w hic h are far from realistic. Under these assumptions, 6 the net w ork reduces to a r andom spin glass mo del , where it is kno wn that the lo w est critical v alues of the r andom loss function form a la y ered structure, and the num b er of local min ima outside that band dimin ish es exp onentia lly with the net wo r k size (Auﬃnger et al., 2013). Ho wev er, th is do es not imply computational eﬃciency: there is no guarant ee that we can ﬁnd such a go od lo cal optimal p oint using compu tat ionally c heap algorithms, sin ce there are s till exp onen tial n u m b er of suc h p oin ts. Haeﬀele and Vidal (2015) pro vid e a general f ramew ork for charact erizing when lo cal optima b ecome global in deep learning and other scenarios. The idea is that if the net wo r k is suﬃcien tly o v ers peciﬁed (i .e., has enough hidd en neurons) suc h that there exist lo cal optima where some of the neurons h a v e zero con tr ib ution, then su c h lo cal optima are in f ac t, global. This provides a simple and a uniﬁed c h aract erization of lo ca l optima whic h are global. Ho w ev er, in general, it is not clear ho w to d esig n algorithms that can reac h these eﬃcien t optimal p oin ts. Previous theoretical w orks for training neural netw orks: Analysis of risk for neu ral net- w orks is a classical pr oblem. Appro ximation error of t wo la yer n eural net work h as b een analyzed in a num b er of w orks (Cyb enko, 1989b; Hornik, 1991 ; Barron, 199 4 ). Barron (1994) provides a b ound on the app ro ximation err or and com bines it with the estimati on error to obtain a r isk b ound, but for a computationall y ineﬃcient m et h od. T he samp le complexit y for neural net- w orks ha ve b een extensive ly an alyzed in Barron (1994); Bartlett (1998), assu ming con ve r ge n ce to the globall y optimal solution, whic h in general is int r ac table. See An thony and Bartlett (2009); Shalev-Shw artz and Ben-Da vid (2014) for an exp osition of classical results on n eural net w orks. Andoni et al. (2014) learn p olynomial t arget fu n ctio n s usin g a t wo -la yer n eural netw ork under Gaussian/uniform input distrib u tion. T hey argue that the weig hts for the ﬁrst lay er can b e selected randomly , and only the second la y er weig hts, whic h are linear, need to b e ﬁtted optimally . Ho wev er, in practice, Gaussian/un iform distrib utions are nev er encoun tered in classiﬁca tion problems. F or general distributions, random w eight s in the ﬁrst la yer is not suﬃ cie nt. Under our framework, we imp ose only mild non-degeneracy conditions on the w eigh ts. Livni et al. (2014) mak e the ob s erv a- tion that netw orks with qu adratic activ ation functions can b e trained in a co mp utatio n all y eﬃcien t manner in an incremental mann er. This is because with quadratic activ ations, greedily adding one neuron at a time can b e solv ed eﬃcien tly through eigen decomp osition. Ho wev er , the stand ard sig- moidal net works require a la r ge dep th p olynomial n et work, whic h is not practi cal. After we p osted the initial v ersion of this pap er, Zhang et al. (2015 ) extended this fr ame work to improp er lea r ning scenario, where the o u tput predictor need n ot b e a neural net w ork. Th ey show that if the ℓ 1 norm of the incoming weigh ts in eac h la yer is b ounded, then learning is eﬃcien t. Ho w eve r , for the u sual neural net works w ith sigmoidal activ ations, th e ℓ 1 norm of the weig hts scales with dimen sion and in this case, the algorithm is no longer p olynomial time. Arora et al. (2013) pro vide b ound s for leaning a class of deep represen tations. They use la ye r -wise learning where the neur al netw ork is learned la yer-b y-la y er in an un sup ervised mann er. Th ey assu me sparse edges w ith r an d om b ound ed w eight s , and 0/1 threshold functions in hidden no des. The diﬀerence is here, w e are considering the su p er v ised s etting where th ere is b oth input and output, and we allo w for general sigmoidal functions at the hid den neurons. Recen tly , after p osting the initial v ersion of this p aper, Hard t et al. (2015) p ro vided an analysis of stoc h astic gradient d esce nt and its generalization error in conv ex and non-co nv ex problems suc h as training neural net wo r ks. They sho w that the generalizatio n error can b e co ntrolled under mild conditions. Ho we ver, their wo r k do es not address ab out reac hing a s ol u tio n with small risk b ound 7 using SGD, and the SGD in general can get stuc k in a spur io u s local optima. On the other hand, w e show that in addition to h a ving a sm all generalization error, o u r metho d yields a neural netw ork with a small risk b ound . Note that our metho d is moment-based estimation, and these m etho ds come with stabilit y b ounds that guarantee goo d generalizat ion error. Closely related to this pap er, Sedghi and Anand kumar (201 4 ) consider learning neural netw orks with sparse c onn ect ivity . They emplo y the cross-momen t b et we en the (multi- class) lab el and (ﬁrst order) score function of the input. They sho w that they can pro v ably learn the w eights of t h e ﬁrst la y er, as long as the w eight s are sparse en ough, and there are enough n u m b er of inpu t dimensions and outpu t classes (at least linear up to log factor in th e n umb er of neur on s in any la y er). In this pap er, we remov e these restrictions and allo w for the output to b e just bin ary class (and indeed, our framew ork applies f or m u lti-class setting as well, since the amount of information increases with mo r e lab el classes from the algorithmic p ersp ectiv e), and for the n umber of neurons t o exceed the inp ut/output dimensions (o v ercomplete setting). Moreo ve r , we extend b eyo n d the realiza b le setting, and do not require the target fu nctio n s to b e generated from th e class of neural net works under consideration. 2 Preliminaries and Problem F orm ulation W e ﬁrst introdu ce some notations and then pr opose the p roblem form ulation. 2.1 Notation Let [ n ] := { 1 , 2 , . . . , n } , and k u k den ote the ℓ 2 or Eu clidea n norm of vecto r u , and h u, v i den ote the inner pro duct of vec tors u and v . F or m at r ix C ∈ R d × k , th e j -th column is r eferred by C j or c j , j ∈ [ k ]. Th roughout this pap er, ∇ ( m ) x denotes the m -t h order deriv ativ e op erator w.r.t. v ariable x . F or matrices A, B ∈ R d × k , the Khatri-R ao pr o duct C := A ⊙ B ∈ R d 2 × k is d eﬁned su c h that C ( l + ( i − 1) d, j ) = A ( i, j ) · B ( l , j ), for i, l ∈ [ d ] , j ∈ [ k ]. T ensor: A real m -th or der tensor T ∈ N m R d is a m em b er of the outer pro duct of Euclidean spaces R d . T he diﬀeren t dimen sions of the tensor are referred to as mo des . F or instance, for a matrix, the ﬁrst mo de refers to columns and the second mo de refers to r o w s. T ensor matricization: F or a third ord er tensor T ∈ R d × d × d , the matricized v ersion along ﬁrst mo de denoted by M ∈ R d × d 2 is deﬁned suc h that T ( i, j, l ) = M ( i, l + ( j − 1) d ) , i, j, l ∈ [ d ] . (1) T ensor rank: A 3rd ord er tensor T ∈ R d × d × d is said to b e rank-1 if it can b e w ritten in the form T = w · a ⊗ b ⊗ c ⇔ T ( i, j, l ) = w · a ( i ) · b ( j ) · c ( l ) , (2) where ⊗ represen ts the outer pr o duct , and a, b, c ∈ R d are un it vec tors. A tensor T ∈ R d × d × d is said to ha ve a CP (Cand ec omp /P arafac) r ank k if it can b e (minimally) written as the sum of k rank-1 tensors T = X i ∈ [ k ] w i a i ⊗ b i ⊗ c i , w i ∈ R , a i , b i , c i ∈ R d . (3) 8 Deriv ative: F or fun ct ion g ( x ) : R d → R with v ector input x ∈ R d , the m -th order deriv ativ e w.r.t. v ariable x is d en ot ed b y ∇ ( m ) x g ( x ) ∈ N m R d (whic h is a m -th order tensor) such that h ∇ ( m ) x g ( x ) i i 1 ,...,i m := ∂ g ( x ) ∂ x i 1 ∂ x i 2 · · · ∂ x i m , i 1 , . . . , i m ∈ [ d ] . (4) When it is clear f r om the con text, we drop the sub script x and write the deriv ativ e as ∇ ( m ) g ( x ). F ourier transform: F or a fu nctio n f ( x ) : R d → R , the m ultiv ariate F ourier transform F ( ω ) : R d → R is deﬁned as F ( ω ) := Z R d f ( x ) e − j h ω ,x i dx, (5) where v ariable ω ∈ R d is called th e frequency v ariable, and j denotes the im aginary u nit. W e also denote the F ourier pair ( f ( x ) , F ( ω )) as f ( x ) F ourier ← − − − → F ( ω ). F unction notat ions: Th r oughout the pap er, we use the follo win g conv en tion to d istinguish diﬀeren t types of fun ctions. W e use f ( x ) (or y ) to d enote an arbitrary function an d exploit ˜ f ( x ) (or ˜ y ) to denote the outpu t of a realizable neural netw ork. Th is helps us to diﬀeren tiate b et w een them. W e also use notation ˆ f ( x ) (or ˆ y ) to d en ot e the estimated (trained) neural netw orks using ﬁnite num b er of samples. 2.2 Problem form ulation W e no w introdu ce the problem of training a neural netw ork in realizable and non-realizable settings, and elaborate on the notion of risk b oun d on ho w the trained n eural net work appro ximates an arbitrary function. It is kno wn that con tinuous fun ctions with compact domain can b e arbitrarily w ell ap p ro ximated b y feedf orward neural net works with one hidden la y er and sigmoidal n onlinear functions (Cyb enko, 19 89a ; Hornik et al., 1989; Barron, 1993). The inpu t (feature) is den ot ed by v ariable x , and output (lab el) is d enot ed b y v ariable y . W e assume the input and output are generated according to some joint den sit y function p ( x, y ) su c h that ( x i , y i ) i . i . d . ∼ p ( x, y ), wh ere ( x i , y i ) denotes the i -th sample. W e assume k n o wledge of th e input densit y p ( x ), and d emonstrate ho w it can b e u sed to train a neur al net w ork to appro xim ate the conditional density p ( y | x ) in a computationally eﬃc ient manner. W e discuss in Section 3.1 ho w the input d ensit y p ( x ) can b e e stimated through numerous metho ds s uc h as score matc hing or spectral metho ds. In settings such as exp erimental design, the input d ensit y p ( x ) is known to the learner since she designs the densit y function, and our framew ork is directly applicable there. In addition, w e do not need to kno w the normaliz ation factor or the partitio n f unction of th e inpu t distribution p ( x ), and the estimation u p to normalization f actor suﬃces. Risk bound: In this pap er, w e prop ose a new algorithm for training n eural net wo r ks an d pr o vid e risk b ounds with r espect to an arbitrary target function. Risk is the exp ected loss ov er the join t probabilit y densit y fu nctio n of inp ut x and output y . Here, w e consider the squared ℓ 2 loss and b ound the risk err or E x [ | f ( x ) − ˆ f ( x ) | 2 ] , (6) 9 σ ( · ) σ ( · ) σ ( · ) σ ( · ) 1 k x 1 x 2 x 3 x d x E [ ˜ y | x ] A 2 A 1 · · · · · · · · · · · · Figure 1: Gra phical representation of a neural net work, E [ ˜ y | x ] = A ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 . Note tha t this representation is for gener al vector output ˜ y which can be a lso written as E [ ˜ y | x ] = h a 2 , σ ( A ⊤ 1 x + b 1 ) i + b 2 in the case of scalar output ˜ y . where f ( x ) is an arbitrary function whic h w e w ant to approximat e by ˆ f ( x ) denoting the estimat ed (trained) neural netw ork. This notion of risk is also called mean in tegrated squared error. The prop osed r isk e r ror for a neural net wo r k c ons ists of t wo parts: ap p ro ximation error an d estimation error. Appr oximation err or is th e err or in ﬁtting the target fun ctio n f ( x ) to a n eural netw ork with the giv en arc hitecture ˜ f ( x ), and estimation e rr or is th e error in training that netw ork with ﬁn ite n u m b er of samp les denoted b y ˆ f ( x ). Thus, the risk error measures the a b ilit y of the trained neural net work to generalize to n ew data generated b y function f ( x ). W e no w int r odu ce th e realizable and non-realizable settings, which elab orates more these sour ce s of err or. 2.2.1 Realizable setting In the r ea lizable setting, the output is generated by a n eural net wo r k . W e consider a neural netw ork with one hidd en lay er of dimen sion k . Let the outpu t ˜ y ∈ { 0 , 1 } b e the b inary lab el, and x ∈ R d b e th e feature v ector; s ee Section 6.2 f or generalization to h igher dimen s io n al outpu t (multi- lab el and multi-c lass), and also the con tinuous output case. W e consider the lab el generating mo del ˜ f ( x ) := E [ ˜ y | x ] = h a 2 , σ ( A ⊤ 1 x + b 1 ) i + b 2 , (7) where σ ( · ) is (linear/nonlinear) element w ise fu nction. S ee Figure 1 for a sc h emat ic represent ation of lab el-function in (7) in th e general case of vec tor outp u t ˜ y . In the realizable setting, the goal is to train the n eu ral net w ork in (7), i.e., to learn the weig ht matrices (v ectors) A 1 ∈ R d × k , a 2 ∈ R k and bias v ectors b 1 ∈ R k , b 2 ∈ R . This only inv olv es the estimatio n analysis where we hav e a lab el-fun ctio n ˜ f ( x ) sp eciﬁed in (7) with ﬁxed u nkno wn parameters A 1 , b 1 , a 2 , b 2 , and the goal is to learn these p aramet ers and ﬁn ally b ound the o v erall function estimation err or E x [ | ˜ f ( x ) − ˆ f ( x ) | 2 ], w here ˆ f ( x ) is the estimation of ﬁxed neural net work ˜ f ( x ) giv en ﬁnite samples. F or this task, we pr opose a computationally eﬃcien t algorithm whic h requires only p olynomial n umb er of samples for b ounded estimation err or. T h is is the ﬁr st time that such a result has b een established for an y neural netw ork. 10 Algorithm 1 NN-LIFT (Neural Netw ork LearnIng using F eature T ensors) input Lab eled samples { ( x i , y i ) : i ∈ [ n ] } , parameter ˜ ǫ 1 , parameter λ . input Th ird order score fun ctio n S 3 ( x ) of the input x ; see Equ at ion (8) for the d eﬁ nitio n . 1: Compute b T := 1 n P i ∈ [ n ] y i · S 3 ( x i ). 2: { ( ˆ A 1 ) j } j ∈ [ k ] = tensor decomp osition( b T ); see Section 3.2 and Ap p end ix B for details. 3: ˆ b 1 = F ourier metho d( { ( x i , y i ) : i ∈ [ n ] } , ˆ A 1 , ˜ ǫ 1 ); see P rocedur e 2 . 4: (ˆ a 2 , ˆ b 2 ) = Ridge regression( { ( x i , y i ) : i ∈ [ n ] } , ˆ A 1 , ˆ b 1 , λ ); s ee Pro ce d ure 3. 5: return ˆ A 1 , ˆ a 2 , ˆ b 1 , ˆ b 2 . 2.2.2 Non-realizable setting In the non-realizable setting, the output is an arbitrary function f ( x ) whic h is not necessarily a neural net work. W e w ant to appro ximate f ( x ) by ˆ f ( x ) d enoting the estimated (trained) neural net work. In this setting, the additional approxima tion analysis is also required . In th is pap er, w e com bine the estimation result in realizable setting with the app ro ximation b ound s in Barron (1993) leading to risk b ound s with resp ect to the target fun cti on f ( x ); see (6) for the deﬁnition of risk. The detailed r esults are provi d ed in Section 5. 3 NN-LIFT Algorithm In this s ec tion, we introd uce our prop osed metho d for learning neural net works u sing tensor, F ourier and regression tec h niques. O ur metho d is sho wn in Alg orithm 1 named NN-LIFT (Neural Ne tw ork LearnIng using F eature T ensors). Th e algorithm has three main c omp onents. The ﬁrst co m pon ent in volv es estimating the weig ht matrix of the ﬁrst la yer d en ot ed by A 1 ∈ R d × k b y a tensor d ec omp o- sition method . The second comp onent inv olv es estimating the bias v ector of th e ﬁ rst la ye r b 1 ∈ R k b y a F ourier met h od. W e ﬁnally estimat e the parameters of la s t la yer a 2 ∈ R k and b 2 ∈ R b y linear regression. Note that most of th e unknown parameters (compare the dimensions of matrix A 1 , v ectors a 2 , b 1 , and scalar b 2 ) are estimated in th e ﬁrst part, and th us, this is the main part of the algorithm. Giv en this fact, we also provide an alternativ e metho d for the estimation of other p aramet ers of the mo del, giv en the estimate of A 1 from the tensor metho d. Th is is based on incremen tally a d ding neurons, one b y one, whose ﬁrst la y er weigh ts are giv en by A 1 and the remaining parameters are up dated using brute f orce searc h on a grid . S ince eac h up date inv olv es ju st up dating the corresp onding bias term b 1 , and its contribution to the ﬁnal output, this is lo w d im en sional , and can b e done eﬃciently; details are in Section 6.3. W e no w explain the steps of Algorithm 1 in more details. 3.1 Score function The m -th ord er score function S m ( x ) ∈ N m R d is deﬁned as (Janzamin et al., 2014) S m ( x ) := ( − 1) m ∇ ( m ) x p ( x ) p ( x ) , (8) where p ( x ) is the probabilit y densit y fu nctio n of random v ector x ∈ R d . In addition, ∇ ( m ) x denotes the m -th order deriv ativ e op erator; see (4) for the pr ec ise deﬁn itio n . Th e main prop ert y of score 11 functions as yielding diﬀeren tial op erators that enables us to estimate the w eigh t m at r ix A 1 via tensor decomp osition is discussed in the next subsection; see Equ atio n (9). In this pap er, w e assu me acce ss to a suﬃcien tly go o d appro ximation of the inp ut p df p ( x ) and the corresp ondin g score fu nctions S 2 ( x ), S 3 ( x ). In deed, estimating th ese quant ities in general is a hard problem, b ut there exist n umerous instances where this b ecomes tractable. Examples include sp ectral m et h ods for le arn ing latent v ariable models su c h as Gaussian mixtures, topic or adm ixture mo dels, indep enden t comp onen t analysis (ICA) and so on (Anandkum ar et al., 2014 b ). Moreo ve r , there h a ve b een recen t adv ances in non-parametric sco r e matc hing methods (Srip erum bud ur et al. , 2013) for density estimation in inﬁnite dimensional exp onentia l families with guarante ed conv er- gence rates. These metho ds can b e u sed to estimate the inp ut p df in an un sup ervised mann er . Belo w, we d iscuss in detail ab out score fun ctio n estimation metho ds. In this pap er, we fo cus on ho w we can use the input generativ e in formatio n to mak e training of neural netw orks tractable. F or simplicit y , in the sub s equen t analysis, w e assum e that th ese quan tities are p erfectly known; it is p ossible to extend the p erturbation analysis to tak e into accoun t th e errors in estimati n g the input p df; see Remark 4. Estimation of score function: There are v arious eﬃcien t method s for estimating the score function. The fr amew ork of score matc hing is p opular for parameter estimation in pr obabilisti c mo dels (Hyv¨ arin en, 2005; Swersky et al., 2011), wh ere the criterion is to ﬁt parameters based on matc hing the data score function. Sw ersky et al. (2011) analyze the score matc h ing for laten t energy-based mo dels. In deep learning, the framewo r k of auto-enco ders attempts to ﬁn d encoding and deco ding fu nctions which minimize the reconstruction error under added noise; the so-called Denoising Auto-Enco ders (DAE). This is an un sup ervised framewo r k in vol v in g only unlab eled samples. Alain and Bengio (2012) argue that the DAE app ro ximately learns th e ﬁrst order score function of the input, as th e noise v ariance go es to zero. Srip erumbudur et al. (20 13 ) p ropose non-parametric score matc hing metho d s for den s it y estimation in inﬁnite dimensional exp onen- tial families with guaran teed con ve r ge n ce rates. Therefore, we can use any of these metho ds for estimating S 1 ( x ) and use the recursiv e form (Jan zamin et al., 2014) S m ( x ) = −S m − 1 ( x ) ⊗ ∇ x log p ( x ) − ∇ x S m − 1 ( x ) to estimate h igher order score fu nctio n s. 3.2 T ensor decomp osition The score f unctions are new represen tations (extracted features) of inpu t data x that can b e us ed for training neural n et works. W e use score functions and lab els of training d ata to form the empirical cross-momen t b T = 1 n P i ∈ [ n ] y i · S 3 ( x i ). W e d eco mp ose tensor b T to est im ate the columns of A 1 . Th e follo wing discussion rev eals w h y tensor decomp osition is relev an t for this task. The score fun ctions ha ve the pr operty of yielding diﬀeren tial op erators with resp ect to the in p ut distribution. More pr eci s ely , for lab el-function f ( x ) := E [ y | x ], Janzamin et al. (2014) sho w that E [ y · S 3 ( x )] = E [ ∇ (3) x f ( x )] . No w for the n eural net wo r k output in (7 ), note that the fu n ctio n ˜ f ( x ) is a non-linear function of b oth input x and w eight matrix A 1 . Th e exp ectation op erator E [ · ] a v erages out the dep endency on x , and the d eriv ativ e acts as a line arization op er ator as follo w s . In the neural net work ou tp ut (7), 12 w e observe that the column s of weigh t v ector A 1 are the linear co eﬃci ents in volv ed with input v ariable x . When taking the deriv ativ e of this fu nction, b y the c h ain rule, these linear co eﬃcien ts sho ws up in the ﬁ n al form. In particular, we sho w in Lemma 6 (see Section 7) that for neural net work in (7), w e ha ve E [ ˜ y · S 3 ( x )] = X j ∈ [ k ] λ j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j ∈ R d × d × d , (9) where ( A 1 ) j ∈ R d denotes the j -th col u mn of A 1 , and λ j ∈ R d enote s the co eﬃcien t; refer to Equation (3) for the notion of tensor rank and its r ank-1 co m ponents. Th is c lariﬁ es how t h e score function acts as a linearization op erator while the ﬁnal outp u t is n onlinear in terms of A 1 . Th e ab o v e form also clariﬁes the reason b ehind using tensor decomp osition in the learning fr amew ork. T ensor decomp osition algorithm: Th e goal of tensor decomp ositio n algorithm is to reco ver the r an k -1 comp onen ts of tensor. F or this step, w e use the tensor decomp ositio n algorithm pro- p osed in (Anandkumar et al. , 2014 b ,c); see App endix B for details. The main step of the tensor decomp osition method is the tensor p ower iter ation whic h is the generalizatio n of matrix p o w er iteration to 3rd order tensors. The tensor p o w er iteration is giv en by u ← T ( I , u, u ) k T ( I , u, u ) k , where u ∈ R d , T ( I , u, u ) : = P j,l ∈ [ d ] u j u l T (: , j, l ) ∈ R d is a multiline ar com bination of tensor ﬁb ers . 3 The con v ergence guarantees of tensor p o w er iteration for orthogonal tensor decomp osition ha v e b een d ev eloped in th e literature (Zh ang and Golub, 2001; Anandkum ar et al., 2014b). Note that w e ﬁrst orthogonalize th e tensor via whitening p rocedu r e and then app ly the tensor p o wer iteration. Th u s, the original tensor d eco mp osition need not to b e orthogonal. Computational C omplexity: It is popu la r t o p erform the t ens or decomp osition in a sto c hastic manner whic h redu ce s the computational complexit y . T his is done b y splitting the data in to mini- batc hes. Starting w ith the ﬁrst mini-batc h, we per f orm a small num b er of tensor p o w er iterations, and then us e th e result as initialization for the next mini-batc h , and so on. As mentioned earlier, w e assume that a suﬃ ciently go od appro ximation of score function tensor is giv en to us. F or sp eciﬁc cases w h ere w e h a v e this tensor in factor f orm, w e can reduce the computational complexit y of NN-LIFT b y n ot c omp uting the whole tensor explicitly . By ha ving factor form, w e mean w hen we can write the score function tensor in terms of summation of r ank-1 comp onen ts whic h could b e the sum m at ion o v er samples, or from other existing structures in th e mo del. W e n o w state a few examples wh en w e ha ve the factor f orm , and provide the compu ta tional complexit y . F or example, if input follo w s Gaussian d istr ibution, the s co r e fun ct ion has a simple p olynomial form, and the computational complexit y of tensor decomp osition is O ( nk dR ), wh er e n is the num b er of samples and R is th e num b er of initializations for the tensor deco mp osition. Similar argument follo ws when the inp ut distribution is mixture of Gaussian distr ibutions. W e can also analyze complexit y for m ore complex inputs. If we ﬁt the input data into a Restricted Boltzmann Mac hines (RBM) mod el, the computational complexit y of our method is 3 T ensor ﬁb ers are the vectors whic h are d eriv ed by ﬁxing all the indices of the tensor except one of them, e.g., T (: , j, l ) in th e ab o ve expansion. 13 Pro cedure 2 F ourier m etho d for estimating b 1 input Lab eled samples { ( x i , y i ) : i ∈ [ n ] } , estimate ˆ A 1 , parameter ˜ ǫ 1 . input Pr ob ab ility density function p ( x ) of the inpu t x . 1: for l = 1 to k do 2: Let Ω l := n ω ∈ R d : k ω k = 1 2 ,   h ω , ( ˆ A 1 ) l i   ≥ 1 − ˜ ǫ 2 1 / 2 2 o , and | Ω l | denotes the sur fac e area of d − 1 dimensional m an if old Ω l . 3: Draw n i.i.d. random fr equ encie s ω i , i ∈ [ n ] , u niformly from set Ω l . 4: Let v := 1 n P i ∈ [ n ] y i p ( x i ) e − j h ω i ,x i i whic h is a complex n um b er as v = | v | e j ∠ v . The operators | · | and ∠ · r espectiv ely d enote the magnitude and phase op erators. 5: Let ˆ b 1 ( l ) := 1 π ( ∠ v − ∠ Σ(1 / 2)), wh ere σ ( x ) F ourier ← − − − → Σ( ω ). 6: end for 7: return ˆ b 1 . O ( nkdd h R ). Here d h is the n u m b er of neu rons of the ﬁrst la y er of the RBM u sed for approximati n g the input distribution. T ensor method s a r e also embarrassingly paralleliza b le. When p erformed in parallel, the computational complexit y w ould b e O (log(min { d, d h } )) with O ( nk dd h R/ log (min( d, d h ))) pro cessors. Alternativ ely , w e can also exploit recen t tensor sk etc hing app roac h es (W ang et al., 2015) for computing tensor d ec omp ositions eﬃcien tly . W ang et al. (201 5 ) build on the idea of coun t sk etc hes and sho w that the runnin g time is linear in the input dimension and the num b er of samples, and is indep endent in the order of the tensor. Thus, tensor decomp osit ions can b e computed eﬃcien tly . 3.3 F ourier metho d The second part of th e algo r ithm estimates the ﬁrst lay er bias v ector b 1 ∈ R k . T h is step is v ery diﬀeren t fr om the p revious step for estimating A 1 whic h wa s b ase d on tensor decomp osition metho ds. Th is is a F ourier-based metho d where complex v ariables are formed using lab eled data and random frequencies in the F ourier domain; see Pro ce d ure 2 . W e p ro v e in Lemma 7 that the en tr ies of b 1 can b e estimated f rom the p h ase of these complex num b ers. W e also observ e in Lemma 7 that the magnitude of these complex n u mb ers can b e u sed to estimate a 2 ; this is d iscussed in App endix C.2. P olynomial-time random dra w from set Ω l : Not e that the r andom frequencies are dra w n from a d − 1 dimensional manifold denoted b y Ω l whic h is the in tersection of sphere k ω k = 1 2 and cone   h ω , ( ˆ A 1 ) l i   ≥ 1 − ˜ ǫ 2 1 / 2 2 in R d . This manifold is actually the s u rface of a s pherical cap. In order to dra w these f requencies in p olynomial time, w e consider the d -dimensional spherical co ordinate sy s te m su c h that one of the angles is d eﬁned based on the cone axis. W e can then directly imp ose the cone constrain t b y limiting the corresp onding angle in the rand om dr a w. In addition, Kothari and Mek a (2014) prop ose a metho d for generating pseud o-rand om v ariables from the sp herical ca p in logarithmic time. Remark 1 (Kn owledge of input distrib ution only up to normaliza tion facto r ) . The c omputatio n of sc or e function and the F ourier metho d b oth involve know le dge ab out input p df p ( x ) . However, w e do not ne e d to know the normalization factor, also known as p artition function, of the input p df. F or the sc or e fu nction, it is imme diately se en fr om the deﬁnition in (8) sinc e the normalization factor 14 Pro cedure 3 Ridge regression metho d for estimating a 2 and b 2 input Lab eled samples { ( x i , y i ) : i ∈ [ n ] } , estimates ˆ A 1 and ˆ b 1 , regularization parameter λ . 1: Let ˆ h i := σ ( ˆ A ⊤ 1 x i + ˆ b 1 ), i ∈ [ n ], d enote the estimation of the neurons . 2: App end ea ch n euron ˆ h i b y the dummy v ariable 1 to represent the bias, and thus, ˆ h i ∈ R k +1 . 3: Let ˆ Σ ˆ h := 1 n P i ∈ [ n ] ˆ h i ˆ h ⊤ i ∈ R ( k +1) × ( k +1) denote the empir ica l cov ariance of ˆ h . 4: Let ˆ β λ ∈ R k +1 denote the estimated parameters b y λ -regularized ridge regression as ˆ β λ =  ˆ Σ ˆ h + λI k +1  − 1 · 1 n   X i ∈ [ n ] y i ˆ h i   , (10) where I k +1 denotes the ( k + 1)-dimensional identit y matrix. 5: return ˆ a 2 := ˆ β λ (1 : k ), ˆ b 2 := ˆ β λ ( k + 1). is c anc ele d out by the division by p ( x ) , and thus, the e stimation of sc or e func tion is at most as har d as estimation of input p df up to normalization factor. In the F ourier metho d, we c an use the non-normalize d estimation of i np u t p df which le ads to a normalization mismatch in the estimation of c orr esp onding c omplex nu mb er. This is not a pr oblem sinc e we only use the phase informa tion of these c omplex numb ers. 3.4 Ridge regression metho d F or the neural net wo r k mo del in (7), give n a go od estimation of neurons, we can estimate the parameters of last la yer b y linear regression. W e p ro vide Pro cedure 3 in w hic h we use ridge regression algorithm to estimate the parameters of last la y er a 2 and b 2 . See App end ix C.3 for the details of ridge regression and the corresp onding an alysis and gu arantees. 4 Risk Bound in the Realizable S et ting In this section, w e provide guarantee s in the realizable setting, wher e the fu nction ˜ f ( x ) := E [ ˜ y | x ] is generated by a neural netw ork as in (7). W e provide the estimation error b ound on the o verall function reco very E x [ | ˜ f ( x ) − ˆ f ( x ) | 2 ] when th e estima tion is done by Algorithm 1. W e p ro vide guarante es in the f ollo w ing s ettings. 1) In the b asic case, we consider the un- der c omplete regime k ≤ d , and pro vide the results assuming A 1 is full column rank . 2) In the second case, we form higher order cross-momen ts and tensorize it int o a lo w er order tensor. T his enables us to learn the net work in the o verco mp lete regime k > d , wh en the Khatri-Rao pr o duct A 1 ⊙ A 1 ∈ R d 2 × k is fu ll column rank. W e call this the over c omplete setting and this can handle up to k = O ( d 2 ). S imilarly , we can extend to larger k by tensorizing higher order momen ts in the exp ense of additional computational complexit y . W e deﬁne the follo wing quantit y for lab el function ˜ f ( · ) as ˜ ζ ˜ f := Z R d ˜ f ( x ) dx. Note that in the binary classiﬁcatio n setting ( ˜ y ∈ { 0 , 1 } ), w e ha ve E [ ˜ y | x ] := ˜ f ( x ) ∈ [0 , 1] w hic h is alw a ys p ositive , and th ere is no s q u are of ˜ f ( x ) consid ered in the ab o ve quantit y . 15 Let η denote th e noise in the neural net work mo del in (7) such that the output is ˜ y = ˜ f ( x ) + η . Note that the noise η is not n ec essarily indep endent of x ; for ins ta n ce , in the classiﬁcatio n setting or binary ou tp ut ˜ y ∈ { 0 , 1 } , the noise in dep endent on x . Conditions for Theorem 3: • Non-degeneracy of w eight v ectors : In th e undercomplete setting ( k ≤ d ), the weig ht matrix A 1 ∈ R d × k is full column rank and s min ( A 1 ) > ǫ , w here s min ( · ) denotes the minimum singular v alue, and ǫ > 0 is related to the target err or in r eco vering the column s of A 1 . In the o v ercomplete setting ( k ≤ d 2 ), the Khatri-Rao pro duct A 1 ⊙ A 1 ∈ R d 2 × k is full column rank 4 , and s min ( A 1 ⊙ A 1 ) > ǫ ; see Remark 3 for generalizati on. • C onditions on nonlinear activ ating function σ ( · ): the co eﬃcien ts λ j := E  σ ′′′ ( z j )  · a 2 ( j ) , ˜ λ j := E  σ ′′ ( z j )  · a 2 ( j ) , j ∈ [ k ] , in (20 ) and (4 0 ) are nonzero. Here, z := A ⊤ 1 x + b 1 is the input to the nonlinear op erator σ ( · ). In addition, σ ( · ) satisﬁes the Li ps c hitz prop ert y 5 with constan t L such that | σ ( u ) − σ ( u ′ ) | ≤ L · | u − u ′ | , for u, u ′ ∈ R . S upp ose that the nonlinear activ ating function σ ( z ) satisﬁes the prop ert y suc h that σ ( z ) = 1 − σ ( − z ). Man y p opular activ ating functions su c h as step fu nction, sigmoid fu n ctio n and tanh function satisfy this last p rop ert y . • Subgaussian noise : There exists a ﬁn ite σ noise ≥ 0 suc h that, almost surely , E η [exp( αη ) | x ] ≤ exp( α 2 σ 2 noise / 2) , ∀ α ∈ R , where η denotes the noise in the outpu t ˜ y . • Bounded statistical leverage : There exists a ﬁ nite ρ λ ≥ 1 such that, almost sur ely , √ k p (inf { λ j } + λ ) k 1 ,λ ≤ ρ λ , where k 1 ,λ denotes th e eﬀectiv e d imensions of the hidden lay er h := σ ( A ⊤ 1 x + b 1 ) as k 1 ,λ := P j ∈ [ k ] λ j λ j + λ . Here, λ j ’s denote the (p ositiv e) eigen v alues of the hidden la yer co v ariance matrix Σ h , and λ is the r eg u la r iza tion parameter of r idge regression. W e no w elab orate on these conditions. Th e non-de gener acy of weight ve ctors are required for the tensor d ec omp osition a n alysis in th e estimation of A 1 . In the undercomplete setting, the algorithm ﬁrst orthogonalizes (through whitening p rocedur e) the tensor give n in (9), and th en decomp oses it through tensor p o w er iteration. N ote that the con verge n ce of p o w er iteration for orthogonal tensor decomp osition is guaran teed (Zhang and Golub , 2001; Anandkum ar et al. , 2014b). F or the 4 It is shown in Bhaska ra et al. (2013) that this cond ition is satisﬁed und er smooth ed analysis. 5 If the step function σ ( u ) = 1 { u> 0 } ( u ) is used as the activ ating function, the Lipschitz property does not hold b ecause of the non-contin uity at u = 0. But, we can assume the Lipschitz property holds in the linear con tinuous part, i. e., when u, u ′ > 0 or u, u ′ < 0. W e then argue th at the input to the step function 1 { u> 0 } ( u ) is w.h.p. in the linear interv al (where th e Lipschitz p rop erty holds). 16 orthogonaliza tion pro cedure to wo r k, we need th e tensor comp onen ts (the columns of matrix A 1 ) to b e linearly indep endent. In the o v ercomplete setting, the algorithm p erforms the same steps with th e additional tensorizing pro cedure; see App endix B for d etails. In this case, a higher ord er tensor is giv en to the alg orith m and it is ﬁrst tensorized b efore p erforming the same steps as in the undercomplete setting. Thus, the same conditions are no w imp osed on A 1 ⊙ A 1 . In addition to the non-degeneracy condition on wei ght matrix A 1 , the c o eﬃcie nts c ondition on λ j ’s is also required to ensure the corresp onding rank -1 components in (9) do not v anish, and th us, the tensor decomp osit ion alg orithm reco v ers th em. Similarly , the co eﬃcien ts ˜ λ j should b e also nonzero to enable us us in g the second order momen t ˜ M 2 in (39) in the w hitening step of tensor decomp osition algorithm. I f one of the co eﬃcien ts v anishes, w e use the other option to per f orm the whitening; see Remark 5 and Pro cedure 5 for detai ls. Note that the amount of n on -linearity of σ ( · ) aﬀects the m ag n itu d e of t h e co eﬃcien ts. It is also w orth mentio n ing that although we u se the th ird deriv ativ e notatio n σ ′′′ ( · ) in charact er izing th e co eﬃcien ts λ j (and s imila r ly σ ′′ ( · ) in ˜ λ j ), w e do not need the diﬀerenti ability of non -linear fun ctio n σ ( · ) in all p oin ts. In particular, when inp ut x is a con tin u ous v ariable, w e use Dirac delta fu nctio n δ ( · ) as the deriv ativ e in non-con tinuous p oint s; for instance, for the deriv ativ e of step function 1 { x> 0 } ( x ), w e h a v e d dx 1 { x> 0 } ( x ) = δ ( x ). Th u s, in general, w e only n ee d the expectations E [ σ ′′′ ( z j )] a n d E [ σ ′′ ( z j )] t o exist for th ese t yp e of f u nctions and the corresp ondin g higher order deriv ativ es. W e imp ose the Lipschitz condition on the non-linear act iv ating fun cti on to limit th e error propagated in the hidd en lay er , when the ﬁ rst lay er p arameters are estimated by the neural netw ork and F ourier metho ds. The condition σ ( z ) = 1 − σ ( − z ) is also assumed to tackl e the sign issu e in reco v erin g the columns of A 1 ; see Remark 6 for the d et ails. The sub gaussian noise and the b ounde d statistic al lever age cond iti ons are stand ard conditions, requir ed for ridge regression, whic h is used for estimating the parameters of the second la yer of the neur al net work. Both parameters σ noise , and ρ λ aﬀect the sample complexit y in the ﬁn al guaran tees. Imp osing additional b ound s on the parameters of the neural netw ork are useful in learning these parameters with compu ta tionally eﬃcien t algorithms since it limits the searching space for training these parameters. In particular, for the F ourier analysis, we assume the follo win g conditions. Supp ose the columns of w eigh t matrix A 1 are normalized, i.e., k ( A 1 ) j k = 1 , j ∈ [ k ], and the en tries of ﬁ rst la y er bias ve ctor b 1 are b oun ded as | b 1 ( l ) | ≤ 1 , l ∈ [ k ]. Note that th e norm alization condition on the column s of A 1 is also n ee d ed for identiﬁabilit y of the parameters. F or instance, if the non- linear op erato r is the step function σ ( z ) = 1 { z > 0 } ( z ), then matrix A 1 is only identiﬁable u p to its norm, and thus, su c h normalization condition is r equired f or iden tiﬁabilit y . The estimation of entries of th e bias v ector b 1 is obtained from the phase of a complex n u m b er through F ourier analysis; see Pro cedure 2 for d et ails. Sin ce there is ambiguit y in the phase of a complex n u m b er 6 , w e imp ose the b ounded assump tio n on the entries of b 1 to a void this ambiguit y . Let p ( x ) satisfy some mild regularit y conditions on the b oundaries of th e supp ort of p ( x ) . In particular, all the entries of (matrix-output) functions ˜ f ( x ) · ∇ (2) p ( x ) , ∇ ˜ f ( x ) · ∇ p ( x ) ⊤ , ∇ (2) ˜ f ( x ) · p ( x ) (11) should go to zero on the b ound aries of supp ort of p ( x ). T hese r eg u larit y conditions are r equ ired for the prop erties of the s co r e function to h old; see Janzamin et al. (2014) for m ore details. In addition to the ab o ve main conditions, we also n ee d some mild conditions whic h are not crucial for the reco very guarantees and are mostly assumed to simplify the p resen tation of th e 6 A complex num b er does not change if an integer multiple of 2 π is add ed to its p hase. 17 main results. These conditions can b e relaxed more. Supp ose the input x is b ounded, i.e. , x ∈ B r , where B r := { x : k x k ≤ r } . Assu me the input p r obabilit y density f u nction p ( x ) ≥ ψ for some ψ > 0, and for any x ∈ B r . The regularit y cond itions in (11) might seem con tradictory w ith the lo w er b ound condition p ( x ) ≥ ψ , but there is an easy ﬁx for that. Th e lo wer b ound on p ( x ) is required for the a n alysis of the F ourier part of th e algorithm. W e can h a ve a conti nuous p ( x ), while in th e F our ier part, we only use x ’s suc h that p ( x ) ≥ ψ , and ignore the rest. This only introdu ce s a pr ob ab ility factor Pr[ x : p ( x ) ≥ ψ ] in th e analysis. Settings of algorithm in Theorem 3: • No. of iterations in Algorithm 7: N = Θ  log 1 ǫ  . • No. of initializations in Pr o cedure 8: R ≥ p oly( k ). • Paramete r ˜ ǫ 1 = ˜ O  1 √ n  in Pro cedure 2, where n is the num b er of samples. • W e exploit the empirical second order moment c M 2 := 1 n P i ∈ [ n ] y i · S 2 ( x i ), in the whitening Pro cedure 5, wh ic h is the ﬁr st option stated in the pro cedure. See Remark 5 for further discussion ab out the other option. Theorem 3 (NN-LIFT guaran tees: estimatio n b oun d in the realizable setti n g) . A ssu me the ab ove settings and c onditions hold. F or ǫ > 0 , supp ose the numb er of samples n satisﬁes (up to log factors) n ≥ ˜ O  k ǫ 2 · E h    M 3 ( x ) M ⊤ 3 ( x )    i (12) · p oly ˜ y max , E    S 2 ( x ) S ⊤ 2 ( x )    E    M 3 ( x ) M ⊤ 3 ( x )    , ˜ ζ ˜ f ψ , ˜ λ max ˜ λ min , 1 λ min , s max ( A 1 ) s min ( A 1 ) , | Ω l | , L, k a 2 k ( a 2 ) min , | b 2 | , σ noise , ρ λ !! . Se e (32) , (55) , (57) and (59) for the c omplete form of sample c omplexity. Her e, M 3 ( x ) ∈ R d × d 2 denotes the matricization of sc or e fu nction tensor S 3 ( x ) ∈ R d × d × d ; se e (1) for the deﬁnition of matricization. F urthermor e, λ min := min j ∈ [ k ] | λ j | , ˜ λ min := min j ∈ [ k ] | ˜ λ j | , ˜ λ max := max j ∈ [ k ] | ˜ λ j | , ( a 2 ) min := min j ∈ [ k ] | a 2 ( j ) | , and ˜ y max is such that | ˜ f ( x ) | ≤ ˜ y max , for x ∈ B r . Then the fu nc tion estimate ˆ f ( x ) := h ˆ a 2 , σ ( ˆ A ⊤ 1 x + ˆ b 1 ) i + ˆ b 2 using the estimate d p ar ameters ˆ A 1 , ˆ b 1 , ˆ a 2 , ˆ b 2 (output of NN-LIFT Algorithm 1 ) satisﬁes the estimation err or E x [ | ˆ f ( x ) − ˜ f ( x ) | 2 ] ≤ ˜ O ( ǫ 2 ) . See Section 7 and App endix C for the proof of theorem. Thus, w e estimate the neural net wo r k in p olynomial time and sample complexit y . This is one of the ﬁrst results to provide a guaran teed metho d for training n eural net works with eﬃcien t computational and statistical complexit y . Note that although the sample complexit y in (Barron, 1993) is smaller as n ≥ ˜ O  k d ǫ 2  , the prop osed algorithm in (Barron , 199 3 ) is n ot computationally eﬃcient. Remark 2 (Samp le complexit y for Gaussian input) . If the input x is Gaussian as x ∼ N (0 , I d ) , then we know that E    M 3 ( x ) M ⊤ 3 ( x )    = ˜ O  d 3  and E    S 2 ( x ) S ⊤ 2 ( x )    = ˜ O  d 2  , and the ab ove sample c omplexity is simpliﬁe d. 18 Remark 3 (Higher order tensorization) . We state d that by tensorizing higher or der tensors to lower or der ones, we c an estimate over c omplete mo dels wher e the hidden layer dimension k is lar ger than the input dimension d . We c an gener alize this ide a to higher or der tensorizing such that m mo des of the higher or der tensor ar e tensorize d into a single mo de in the r esulting lower or der tensor. This enables us to estimate the mo dels up to k = O ( d m ) assuming the matrix A 1 ⊙ · · · ⊙ A 1 ( m Khatri-R ao pr o ducts) is ful l c olumn r ank. This i s p ossible with t he higher c omputat i on al c omplexity. Remark 4 (Eﬀect of erroneous estimation of p ( x )) . The input pr ob ability density function p ( x ) is dir e ctly use d in the F ourier p art of the algorithm, and also indir e ctly use d in the tensor de c omp osition p art to c ompute the sc or e function S 3 ( x ) ; se e (8 ) . In the ab ove analysis, to simplify the pr esentation, we assume we exactly know these f u nctions, and thus, ther e is no additional err or intr o duc e d by estimating them. It is str aightforwar d to inc orp or ate the c orr e sp onding err ors in estimating input density into the ﬁnal b ound. Remark 5 (Alternativ e wh ite n ing pro decure) . In whitening Pr o c e dur e 5, two options ar e pr ovide d for c onstructing the se c ond or der moment M 2 . In the a b ove analysis, we use d the ﬁrst option wh i c h exploits the se c ond or der sc or e f unction. If any c o eﬃcient ˜ λ j , j ∈ [ k ] , in (39) vanishes, we c annot use the se c ond or der sc or e fu nc tion in the whitening pr o c e dur e, and we use the other option for whitening; se e Pr o c e dur e 5 for the details. 5 Risk Bound in the Non-realizable Setting In this section, we pro vide the risk b oun d for training the neur al n etw ork with resp ect to an arbitrary target function; see S ect ion 2.2 for the deﬁnition of the risk. In o r der to provide the risk b ound with respect to a n arbitrary target function, we a lso need to argue the app ro ximation error in add iti on to the estimation error. F or an arbitrary function f ( x ), w e need to ﬁn d a n eural net work wh ose error in ap p ro ximating the f unction can b e b ounded. W e then com b ine it with the estimatio n error in training that n eural net work. This yiel d s th e ﬁnal risk b ound. The appro ximation problem is ab out ﬁnd in g a n eural net work that app ro ximates an arb itrary function f ( x ) w ith b ounded error. T h us, this is diﬀerent from the realizable setting where th er e is a ﬁxed neural n et w ork and we only analyze its estimatio n . Barron (1993) pro vides an approxi m at ion b ound for the t w o-la ye r n eural n etw ork and w e exploi t that here. His resu lt is based on the F ourier prop erties of f unction f ( x ). Recall fr om (5) the deﬁnition of F ourier transform of f ( x ), denoted b y F ( ω ), where ω is called the frequency v ariable. Deﬁne the ﬁr s t absolute m oment of the F ourier magnitude distribu tio n as C f := Z R d k ω k 2 · | F ( ω ) | dω . (13) Barron (1993) analyzes the app ro ximation pr operties of ˜ f ( x ) = X j ∈ [ k ] a 2 ( j ) σ  h ( A 1 ) j , x i + b 1 ( j )  , k ( A 1 ) j k = 1 , | b 1 ( j ) | ≤ 1 , | a 2 ( j ) | ≤ 2 C f , j ∈ [ k ] , (14) where the columns of weigh t matrix A 1 are the normalized version of r andom fr equ encie s drawn from the F ourier magnitude distribution | F ( ω ) | w eigh ted b y th e norm of th e frequen cy v ector. More precisely , ω j i . i . d . ∼ k ω k C f | F ( ω ) | , ( A 1 ) j = ω j k ω j k , j ∈ [ k ] . (15) 19 See Section 7. 2.1 for a deta iled discussion on this conn ec tion b et w een the columns of w eigh t matrix A 1 and the random frequ ency dra ws from the F ourier magnitude distribution, and see ho w this is argued in the pro of of the approximat ion b ound . The other parameters a 2 , b 1 need to b e also found. He then sh o ws the f ol lowing appr o xim ation b ound for (14). Theorem 4 (Ap p ro ximation b ound , Th eorem 3 of Barron (1993)) . F or a function f ( x ) with b ounde d C f , ther e exists a n app r oximation ˜ f ( x ) in the form of (14) th at satisﬁes the appr oximation b ound E x [ | f ( x ) − ˜ f ( x ) | 2 ] ≤ O ( r 2 C 2 f ) ·  1 √ k + δ 1  2 , wher e f ( x ) = f ( x ) − f (0) . Her e, for τ > 0 , δ τ := inf 0 <ξ ≤ 1 / 2 ( 2 ξ + sup | z |≥ ξ   σ ( τ z ) − 1 { z > 0 } ( z )   ) (16) is a distanc e b etwe en the unit step function 1 { z > 0 } ( z ) and the sc ale d sigmoidal f unction σ ( τ z ) . See Barron (199 3 ) for the complete proof of the a b ov e th eorem. F or c omp le teness, we hav e also review ed the main ideas of this pro of in Section 7.2. W e no w pro vide the formal statemen t of our risk b ound. Conditions for Theorem 5: • Th e nonlinear activ ating fu nctio n σ ( · ) is an arbitrary sigmoidal function sati sf ying t h e afore- men tioned Lip s c h itz condition. Note that a sigmoidal function is a b oun d ed measurable function on the real lin e for w h ic h σ ( z ) → 1 as z → ∞ and σ ( z ) → 0 as z → −∞ . • F or ǫ > 0, s upp ose the num b er of samples n satisﬁes (up to log factors) n ≥ ˜ O  k ǫ 2 · E h    M 3 ( x ) M ⊤ 3 ( x )    i (17) · p oly y max , E    S 2 ( x ) S ⊤ 2 ( x )    E    M 3 ( x ) M ⊤ 3 ( x )    , ζ f ψ , ˜ λ max ˜ λ min , 1 λ min , s max ( A 1 ) s min ( A 1 ) , | Ω l | , L, k a 2 k ( a 2 ) min , | b 2 | , σ noise , ρ λ !! , where ζ f := R R d f ( x ) 2 dx ; notice the diﬀerence with ˜ ζ ˜ f . Note that this is the same s amp le complexit y as in Theorem 3 with ˜ y max substituted w ith y max and ˜ ζ ˜ f substituted w ith ζ f . • Th e targe t f unction f ( x ) is b ounded, and for ǫ > 0, it h as b ounded C f as C f ≤ ˜ O  1 √ k + δ 1  − 1 ·  1 √ k + ǫ  · 1 r E h kS 3 ( x ) k 2 i (18) · p oly   1 r , E h kS 3 ( x ) k 2 i E h kS 2 ( x ) k 2 i , ψ , ˜ λ min ˜ λ max , λ min , s min ( A 1 ) s max ( A 1 ) , 1 | Ω l | , 1 L , ( a 2 ) min k a 2 k , 1 | b 2 | , 1 σ noise , 1 ρ λ     . See (60) and (62) f or th e complete form of b ound on C f . F or Gaussian in put x ∼ N (0 , I d ), w e ha ve p E [ kS 3 ( x ) k 2 ] = ˜ O  d 1 . 5  , and r = ˜ O ( √ d ). See C oroll ary 1 for examples of functions that satisfy th is b ound, and thus, we can learn them b y the pr oposed metho d. 20 • Th e co eﬃci ents λ j := E [ σ ′′′ ( z j )] · a 2 ( j ), and ˜ λ j := E [ σ ′′ ( z j )] · a 2 ( j ), j ∈ [ k ], in (20) and (40) are non-zero. • k random i.i.d. dr a w s of frequencies in E q u ati on (15) are linearly indep enden t. Note that the draws are from F our ier magnitude distribution 7 k ω k · | F ( ω ) | . F or more discu s sions on this condition, see Section 7.2.1 and earlier explanations in this section. In the ov ercomplete regime, ( k > d ), the linear indep endence prop ert y n eeds to h old for appropr iate tensorizations of the fr equency dra ws. The ab o v e requiremen ts on the num b er of samples n and parameter C f dep end on the pa- rameters of the n eural net work A 1 , a 2 , b 1 and b 2 . Note that th ere is also a dep endence on these parameters through co eﬃcien ts λ j and ˜ λ j . Sin ce this is the non-realizable s etting, these neural net- w ork parameters corresp ond to the n eural net works that satisfy the approximat ion boun d prop osed in Theorem 4 and are generate d via random d ra ws from the frequency sp ectrum of the function f ( x ). The p roposed b ound on C f in (18) is stricter when th e num b er of hidden u nits k increases. This migh t seem coun ter-intuiti ve, since the appr o xim ation result in T h eorem 4 suggests that increasing k leads to smaller app ro ximation err or. But, note that the appro ximation resu lt in Theorem 4 d oes not consider eﬃcient training of the n eu ral net work. The result in T heorem 5 also deals with the eﬃcien t estimation o f the neural netw ork. This imp oses additional constraint on the paramete r C f suc h that when the n u m b er of neurons increases, the problem of learning the net work we ights is more chall en ging for the tensor metho d to resolv e. Theorem 5 (NN-LIFT guarant ees: risk b ound) . Supp ose the ab ove c onditions hold. Then the tar get function f is appr oximate d by the neur al network ˆ f which is le arnt using NN-LIFT in Algo - rithm 1 satisfying w.h.p. E x [ | f ( x ) − ˆ f ( x ) | 2 ] ≤ O ( r 2 C 2 f ) ·  1 √ k + δ 1  2 + O ( ǫ 2 ) , wher e δ τ is deﬁne d in (16) . R e c al l x ∈ B r , wher e B r := { x : k x k ≤ r } . The theorem is mainly pr o ved by com binin g the estimation b ound gu arantees in Theorem 3, and the appr oximati on b ound r esults for neur al net w orks pro vided in Th eo r em 4. B u t n ote that the app ro ximation b ound pro vid ed in Th eo r em 4 holds for a s p eciﬁc class of neural netw orks wh ic h are not generally reco vered by the NN-LIFT algorithm. In addition, the estimation guaran tees in Theorem 3 is f or the realizable setting where the observ ations are the outputs of a ﬁ xed n eural net work, while in Th eo r em 5 w e observe samp les of arbitrary function f ( x ). Thus, the approxima- tion analysis in Theorem 4 can n ot b e directly app lie d to T h eorem 3. F or this, we need additional assumptions to ensure the NN-LIFT algorithm reco v ers a neural net wo r k which is close to one of the neural net works that satisfy the ap p ro ximation b ound in Theorem 4. Therefore, w e imp ose the b ound on qu antit y C f , and the fu ll column rank assumption prop osed in Th eo r em 4. See App endix D for the complete p roof of Theorem 5. The ab o v e risk b ound includes t wo terms. The ﬁrst term O ( r 2 C 2 f ) ·  1 √ k + δ 1  2 represent s the appro ximation error on how the arbitrary fu nctio n f ( x ) w ith quan tit y C f can b e approxi m at ed b y the neural net wo r k, wh ose w eights are d ra wn from th e F ourier magnitude distribution; see 7 Note that it should be normalized to b e a probability distribu t io n as in (15 ). 21 Theorem 4 for the formal s tatement . F rom th e deﬁnition of C f in (13), this b ound is w eaker when the F ourier sp ectrum of target f ( x ) has more ener gy in higher frequencies. This mak es in tuitive sense since it sh ou ld b e easier to appro ximate a fun ction whic h is m ore smo oth and has less ﬂuctuations. The second term O ( ǫ 2 ) is f rom estimation error f or NN-LIFT alg orithm , whic h is analyzed in Theorem 3. The p olynomial factors for samp le complexit y in our estimation err or are sligh tly worse than the b ound pr o vided in Barr on (1994 ), but note that we p ro vide an estimation metho d whic h is b oth compu tat ionally and statistica lly eﬃc ient, while the metho d in Barron (1994 ) is not computationally eﬃcien t. Th u s, for th e ﬁrs t time, w e hav e a computationally eﬃcien t metho d with guarant eed risk b ounds for training n eural net works. Discussion on δ τ in the appro ximation b ound: Th e app ro ximation b ound in vol ves a term δ τ whic h is a constan t and do es not shrink with in crea s in g the neuron size k . Recall that δ τ measures the d ista n ce b et ween the unit step function 1 { z > 0 } ( z ) and the scaled sigmoidal fun cti on σ ( τ z ) (which is u sed in the neur al net work sp eciﬁed in (7)). W e no w provi d e the follo w ing t wo observ ations The ab o v e risk b ound is only provided for the case τ = 1. W e can generalize this resu lt b y imp osing diﬀerent constrain t on the norm of columns of A 1 in (14). In general, if we imp ose k ( A 1 ) j k = τ , j ∈ [ k ], for s ome τ > 0, then w e ha ve the appro xim ation b ound 8 O ( r 2 C 2 f ) ·  1 √ k + δ τ  2 . Note that δ τ → 0 when τ → ∞ (the scaled sigmoidal fu nctio n σ ( τ z ) conv erges to the unit step function), and thus, this constant appr oximati on error v anishes. If the sigmo id al function is the unit ste p function as σ ( z ) = 1 { z > 0 } ( z ), th en δ τ = 0 for all τ > 0, and hence, there is no such constan t appro ximation error. 6 Discussions and Extensions In this section, we pro vide add itio n al discu s sions. W e ﬁr st prop ose a toy example con trasting the hardness of optimization p roblems bac kpropagation and tens or decomp osition. W e then discuss the generalizat ion of learning guaran tees to higher dimensional ou tp ut, and al s o the contin uous output case. W e then discuss an alternativ e appr oa ch for estimating the low-dimensional parameters of the mo del. 6.1 Con trasting t he loss surface of bac kpropagation with tensor decomp osition W e discussed that the computational hard n ess of trainin g a neural n et work is du e to the non- con ve xity of the loss function, and th us, p opu la r lo cal searc h metho ds suc h as bac kp ropaga tion can get stuc k in spu rious lo cal optima. W e now pro vide a toy example highlight in g this issue, and con trast it w ith the tensor decomp osition appr oa ch. W e consider a simp le binary classiﬁcation task sh o wn in Figure 2.a, where blue and m ag n et a data p oin ts corresp ond to tw o d iﬀerent classes. It is clear that these tw o classes can b e classiﬁed by a mixture of t wo linear classiﬁers w hic h are dra wn as green solid lines in the ﬁ gure. F or this task, we consider a t wo- lay er neural net wo r k with t wo hid den neurons. The loss surf aces for bac kp r opag ation and tensor decomp ositio n are shown in Figures 2.b and 2.c, resp ectiv ely . They are sho wn in terms 8 Note that this change also needs some straightforw ard appropriate modiﬁcations i n th e algorithm. 22 x 1 x 2 y = 1 y = − 1 Local o p tim um Global optimum (a) Classiﬁca tion s etup −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 200 250 300 350 400 450 500 550 600 650 A 1 (1 , 1) A 1 (2 , 1) (b) Loss surface for bac k pr op. −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 0 20 40 60 80 100 120 140 160 180 200 A 1 (1 , 1) A 1 (2 , 1) (c) Loss surface for tensor method Figure 2: (a) Classiﬁcation task: tw o c o lors co rrespo nd to binary labe ls . A t wo-lay er neural netw o rk wit h t wo hidden neurons is used. Loss surface in terms of the ﬁr s t lay er weight s of one neuron (i.e., weights connecting the inputs to the neuron) is plotted while other parameter s are ﬁxed. (b) Loss s urface for usual square loss ob jective has spurio us lo cal optima. In pa r t (a), one of the spurious lo cal optima is drawn as red dashed lines and the global optimum is drawn as green solid lines. (c) Loss surface for tensor factorization ob jective is free of spurious loca l optima. of the we ight paramete r s of inputs to the ﬁrst neuron, i.e., the ﬁrst column of matrix A 1 , while t h e w eight parameters to the second neuron are randomly dr a w n, and then ﬁxed. The stark con trast b et w een th e optimization land scape of tensor ob jectiv e function, and th e usual square loss ob jectiv e used for b ac kp ropaga tion are observ ed, where ev en for a v ery simp le classiﬁcation t ask, bac k p ropagat ion suﬀers from spurious local o p tima (one set of them is drawn as red dashed lines), whic h is n ot the case with tensor metho ds that is at least lo cally con vex. This comparison highligh ts the algorithmic adv anta ge of tensor d eco mp osition compared to bac kp ropa- gation in terms of th e optimiza tion they are p erform ing. 6.2 Extensions to cases b ey ond binary classiﬁcation W e earlier limited ourselv es to the case where th e output of neural net work ˜ y ∈ { 0 , 1 } is binary . These results can b e easily extended to more complica ted cases such a s higher dimensional output (m ulti-lab el and multi-class), and also the con tinuous outputs (i.e., regression setting). In th e rest of this sec tion, we discu s s ab out the necessary c h ange s in the algorithm to adapt it for t h ese cases. In the multi-dimensional case, the output lab el ˜ y is a v ector generated as E [ ˜ y | x ] = A ⊤ 2 σ ( A ⊤ 1 x + b 1 ) + b 2 , where the outpu t is either d iscr ete (multi-l ab el and multi- class) or cont inuous. Reca ll th at the al- gorithm in cludes three main p arts: tensor dec omp osition, F ourier and ridge regression comp onen ts. T ensor decomp osition: F or the tensor decomp osition part, we ﬁrst form the emp irical v ersion of ˜ T = E [ ˜ y ⊗ S 3 ( x )]; note that ⊗ is used here (instead of scalar pro duct used earlier) since ˜ y is not a scalar anymore. By the p rop ertie s of score fu nction, this tensor has decomp ositio n form ˜ T = E [ ˜ y ⊗ S 3 ( x )] = X j ∈ [ k ] E  σ ′′′ ( z j )  · ( A 2 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j , 23 where ( A 2 ) j denotes the j th ro w of matrix A 2 . This is pr o v ed similar to Lemma 6 . The tensor ˜ T is a fourth order tensor, and we con tract the ﬁ rst mo de b y m u lti p lying it with a r andom v ector θ as ˜ T ( θ , I , I , I ) leading to the same form in (19) as ˜ T ( θ , I , I , I ) = X j ∈ [ k ] λ j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j , with λ j c hanged to λ j = E [ σ ′′′ ( z j )] · h ( A 2 ) j , θ i . Therefore, the same tensor decomp osition guaran tees in the binary case also hold h ere when the empirical version of ˜ T ( θ , I , I , I ) is the input to the algorithm. F ourier metho d: Similar to the scalar case, we can use one of the en tr ies of output to estimate the en tries of b 1 . There is an additional diﬀerence in the co ntin uous case. Supp ose th at the o u tput is generated as ˜ y = ˜ f ( x ) + η wh ere η is noise v ector w h ic h is indep endent of inpu t x . In this case, the parameter ˜ ζ ˜ f corresp onding to l th en try of output ˜ y l is c hanged to ˜ ζ ˜ f := R R d ˜ f ( x ) 2 l dx + R R η 2 l dt. Ridge regression: The ridge reg r ession met h od and analysis can be immediately generalize d to non-scalar output by applying the metho d indep endently to diﬀerent ent r ies of output vect or to reco v er d iﬀeren t columns of matrix A 2 and diﬀerent en tries of v ector b 2 . 6.3 An alternativ e for est imating low-dime nsional parameters Once we ha ve an estimate of the ﬁ r st lay er w eigh ts A 1 , w e can greedily (i.e., incremental ly) add neurons with the wei ght vecto r s ( A 1 ) j for j ∈ [ k ], and c ho ose the bias b 1 ( j ) through grid search, and learn its contribution a 2 ( j ) f or its ﬁn al output. This is on the lines of the metho d p rop osed in Barr on (1993), with one crucial d iﬀerence that in our case, the ﬁrs t la y er we ights A 1 are a lr eady estimated by the tensor metho d. Barron (1993) prop oses optimizing for eac h w eigh t ve ctor ( A 1 ) j in d -dimen s io n al space, whose compu tat ional complexit y can scale exp onentia lly in d in the w orst case. But, in our setup here, since w e ha ve already estimated the high-dimensional p aramet ers (i.e., the columns of A 1 ), we only need to estimate a few lo w dim en sional parameters. F or the new hid den un it ind exed by j , these parameters includ e the b ia s from inpu t lay er to the n euron (i.e., b 1 ( j )), and the w eight from the n euron to the output (i.e., a 2 ( j )). Th is mak es the approac h computationally tractable, and we can even use brute-force or exhaustiv e search to ﬁnd the b est parameters on a ﬁnite set and get guaran tees akin to (Barron, 1993). 7 Pro of Sk etc h In this s ec tion, we pro vid e k ey ideas for proving the main results in T heorems 3 and 4. 7.1 Estimation b ound The estimation b ound is prop osed in Theorem 3, and the complete pro of is pro vided in App endix C . Recall that NN-LIFT algorithm in cludes a tensor decomp osition part for estimating A 1 , a F ourier tec hnique for estimating b 1 , and a linear regression for estimating a 2 , b 2 . Th e applicat ion of lin ear regression in the last la ye r is immediately clear. In this section, w e pr op ose t wo main lemmas whic h clarify why the other metho ds are useful for estimat in g the unkno wn parameters A 1 , b 1 24 in the r ea lizable s etting, wh ere the lab el ˜ y is generated by the n eural net work with the give n arc hitecture. In the follo wing lemma, we sho w how the cross-momen t b etw een lab el and score fun ction as E [ ˜ y · S 3 ( x )] leads to a tensor decomp osition form for estimating weig ht matrix A 1 . Lemma 6. F or the two-layer neur al network sp e ciﬁe d in (7) , we have E [ ˜ y · S 3 ( x )] = X j ∈ [ k ] λ j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j , (19) wher e ( A 1 ) j ∈ R d denotes the j -th c olumn of A 1 , and λ j = E  σ ′′′ ( z j )  · a 2 ( j ) , (20) for ve ctor z := A ⊤ 1 x + b 1 as the input to the nonline ar op er ator σ ( · ) . This is prov ed by the mai n prop ert y of s core f u nctions as yielding diﬀeren tial op erators, where for lab el-function f ( x ) := E [ y | x ], w e ha ve E [ y · S 3 ( x )] = E [ ∇ (3) x f ( x )] (Jan zamin et al. , 2014); see Section C.1 for a complete pro of of the lemma. This lemma sho ws that b y decomp osing the cross- momen t tensor E [ ˜ y · S 3 ( x )], we can reco ve r the columns of A 1 . W e also exploit the phase of complex n u m b er v to estimate the bias ve ctor b 1 ; see Procedur e 2. The follo wing lemma clariﬁes this. Th e p er tu rbation analysis is provided in the app endix. Lemma 7. L et ˜ v := 1 n X i ∈ [ n ] ˜ y i p ( x i ) e − j h ω i ,x i i . (21) Notic e this is a r e alizable of v in Pr o c e dur e 2 wher e the output c orr esp onds to a neur al ne twork ˜ y . If ω i ’s ar e uniformly i.i.d. dr awn fr om se t Ω l , then ˜ v has me an (which is c ompute d over x , ˜ y and ω ) E [ ˜ v ] = 1 | Ω l | Σ  1 2  a 2 ( l ) e j π b 1 ( l ) , (22) wher e | Ω l | denotes the surfac e ar e a of d − 1 dimensional manifold Ω l , and Σ( · ) denotes the F ourier tr ansform of σ ( · ) . This lemma is prov ed in App endix C.2. 7.2 Appro ximation bound W e exploit the appro ximation b oun d argued i n Barron (1993) pr ovided in T heorem 4. W e ﬁrst d is- cuss his main result arguing an app ro ximation b oun d O ( r 2 C 2 f /k ) for a function f ( x ) with b ounded parameter C f ; see (13) for the d eﬁnition of C f . Note that this corresp onds to the ﬁrst term in the appro xim ation error prop osed in Theorem 4. F or this result, Barron (1993) do es not consider an y b oun d on the parameters of ﬁr st la y er A 1 and b 1 . He then pr o v id es a reﬁnemen t of this result where he also b ounds the parameters of n eural n et work as w e also do in (14). This leads to the additional term inv olving δ τ in the approxima tion error as seen in Theorem 4. Note that b ounding the parameters of neural n etw ork is also useful in learning these p aramet er s w ith computation- ally eﬃcient algorithms sin ce it limits the searc hing space for training these parameters. W e n ow pro vid e the main ideas of pro vin g these b ounds as follo ws. 25 7.2.1 No b ounds on the pa rameters of the neural netw ork W e ﬁrst pro vide the pro of outline when there is no additional constrain ts on the parameters of neural net w ork; see set G deﬁned in (24), and compare it with the form w e use in (14) where th ere are additional b ounds. In this case, Barron (1993) argues approximati on b ound O ( r 2 C 2 f /k ) which is pr o ved based on tw o m ai n resu lts. The ﬁrst result sa ys that if a function f is in th e closure of the con vex hull of a set G in a Hilb ert space, then for every k ≥ 1, there is an f k as the con vex com bination of k p oin ts in G su c h that E [ | f − f k | 2 ] ≤ c ′ k , (23) for any constant c ′ satisfying some lo wer b ound related to the p rop ertie s of set G and fun ct ion f ; see Lemma 1 in Barron (1993) for the precise statemen t and the p roof of this r esult. The second p art of the p roof is to argue that arbitrary function f ∈ Γ (where Γ d en ot es the set of fu nctio n s with b ounded C f ) is in the closure of the con vex hull of sigmoidal fun ct ions G :=  γ σ  h α, x i + β  : α ∈ R d , β ∈ R , | γ | ≤ 2 C  . (24) Barron (1993) p ro v es this result by arguing the follo wing chai n of inclusions as Γ ⊂ cl G cos ⊂ cl G step ⊂ cl G, where cl G denotes the closure of set G , and sets G cos and G step resp ectiv ely denote set of some sin u soidal and step functions. See T h eorem 2 in Barron (1993) for the precise statemen t and the pro of of this resu lt. Random fre quency dra ws from F ourier magnitude distribution: Recall from Section 5 that the columns of weigh t matrix A 1 are the normaliz ed version of r andom frequen ci es dr a wn fr om F ourier magnitude distribution k ω k · | F ( ω ) | ; see Equation (15). Th is connection is along the pro of of relatio n Γ ⊂ cl G cos that w e recap here; see pro of of Lemma 2 in Barron (1993) for more details. By expand ing the F our ier transform as magnitude and ph ase parts F ( ω ) = e j θ ( ω ) | F ( ω ) | , w e ha ve f ( x ) := f ( x ) − f (0) = Z g ( x, ω )Λ( dω ) , (25) where Λ( ω ) := k ω k · | F ( ω ) | /C f (26) is the normalized F ourier magnitude distribution (as a probabilit y distribu tion) wei ghted b y the norm of fr equency v ector, and g ( x, ω ) := C f k ω k (cos( h ω , x i + θ ( ω )) − cos ( θ ( ω ))) . The inte gral in (25) is an inﬁn ite co nv ex com bination of f u nctions in the class G cos :=  γ k ω k (cos( h ω , x i + β ) − cos( β )) : ω 6 = 0 , | γ | ≤ C , β ∈ R  . 26 No w if ω 1 , ω 2 , . . . , ω k is a r andom sample of k p oin ts indep endent ly drawn from F our ier ma gn itu d e distribution Λ, th en b y F ubini’s Theorem, we ha ve E Z B r   f ( x ) − 1 k X j ∈ [ k ] g ( x, ω j )   2 µ ( dx ) ≤ C 2 k , where µ ( · ) is the probabilit y measure for x . This sh o ws function f is in the con ve x hull of G cos . Note that the b ound C 2 k complies th e b ound in (23). 7.2.2 Bounding the parameters of the neural netw ork Barron (199 3 ) then imp oses additional b oun ds on the weig hts of ﬁ rst la y er, c ons idering the follo wing class of sigmoidal functions as G τ :=  γ σ  τ ( h α, x i + β )  : k α k ≤ 1 , | β | ≤ 1 , | γ | ≤ 2 C  . (27) Note that the ap p ro ximation prop osed in (14) is a con v ex com bination of k p oin ts in (27) with τ = 1. Barron (1993) concludes Theorem 4 by the follo win g lemma. Lemma 8 (Lemma 5 in Barron (1993 )) . If g is a function on [ − 1 , 1] with derivative b ounde d 9 by a c onstant C , then for every τ > 0 , we have inf g τ ∈ cl G τ sup | z |≤ τ | g ( z ) − g τ ( z ) | ≤ 2 C δ τ . Finally Theorem 4 is pro v ed b y applying triangle inequalit y to b ounds argued in the ab o ve tw o cases. 8 Conclusion W e hav e pr op osed a n o vel algorithm b ased on tensor decomp osition for training tw o-la y er neur al net works. This is a computationally eﬃcient metho d with guaranteed risk b ounds with resp ect to the target function under p olynomial sample complexity in the i n put a n d neuron dimensions. The tensor metho d is em barr assingly parallel and has a parallel time computational complexit y which is logarithmic in in put dimension wh ic h is comparable with parallel stochasti c backpropaga tion. There are n umb er of op en problems to consider in f u ture. Extend ing this framew ork to a m ulti-la yer net work is of great interest. Exploring the score f unction fr amework to train other d iscriminativ e mo dels is also interesting. Ac kno wle dgements W e thank Ben Rech t for p oin ting out to us th e Barron’s work on app ro ximation b ound s for n eu- ral net w orks (Barron , 1993), and thank Arthur Gretton for discussion ab out estimation of score function. W e also ac knowledge fru itfu l discuss ion with Roi W eiss ab out the presenta tion of p roof 9 Note that t he condition on ha v ing b ounded deriv ative do es not rule out cases suc h as step funct io n as the sigmoidal fun ctio n . This is b ecause similar to the analysis for the main case (no b ounds on the weig hts), we ﬁrst argue that function f is in th e clo sure of fun ct io n s in G cos whic h are un ivar iate f u n ctions with b ounded deriv ative. 27 of Theorem 5 on com b ining estimatio n and app ro ximation b ounds, and his detailed editorial com- men ts ab out the preliminary v ersion of the draft. W e thank Pete r Bartlett for detailed discussions and p ointing us to several classical results on neural n et works. W e are v ery thankfu l f or An drew Barron for detailed discussion and for encouraging us to explore alternate incremen tal training metho ds for estimating remaining p aramete r s after the tensor decomp osition step an d w e hav e added this discussion to the paper. W e thank Daniel Hsu for discu ssion on random design analysis of rid ge reg r ession. W e thank Percy Liang for d iscussion ab out score function. M. Janzamin is su pp orted b y NSF BIGD A T A a wa r d F G1645 5. H. Sedghi is supp orted by NSF Career aw ard F G15890. A. Anand kumar is supp orted in p art by Microsoft F aculty F ello wship , NSF Career a wa r d C CF-1 254106, and ONR Aw ard N00014 -14-1-0665. A T ensor Notation In this Section, w e p r o vide the additional tensor notation required for the analysis provided in the supplementary material . Multilinear form: The m u lti lin ear form for a tensor T ∈ R q 1 × q 2 × q 3 is deﬁn ed as follo ws . Con- sider matrice s M r ∈ R q r × p r , r ∈ { 1 , 2 , 3 } . Then tensor T ( M 1 , M 2 , M 3 ) ∈ R p 1 × p 2 × p 3 is deﬁned as T ( M 1 , M 2 , M 3 ) := X j 1 ∈ [ q 1 ] X j 2 ∈ [ q 2 ] X j 3 ∈ [ q 3 ] T j 1 ,j 2 ,j 3 · M 1 ( j 1 , :) ⊗ M 2 ( j 2 , :) ⊗ M 3 ( j 3 , :) . (28) As a simp ler case, for vec tors u, v , w ∈ R d , w e ha ve 10 T ( I , v , w ) := X j,l ∈ [ d ] v j w l T (: , j, l ) ∈ R d , (29) whic h is a multilinear com b inatio n of the tensor mo de-1 ﬁb ers. B Details of T ensor Decomp osition Algorithm The goa l of tensor d ecomp osition algorithm is to reco v er the rank-1 comp onen ts of tensor; refer to E quatio n (3) for the notion of tensor rank and its rank-1 comp onen ts. W e exp lo it the tensor decomp osition algorithm pr op osed in (Anand kumar et al., 2014b,c). Figure 3 depicts the ﬂo w chart of this method wh er e the corresp onding algorithms and pro cedures are also sp eciﬁed. Similarly , Algorithm 4 states the high-lev el steps of tensor decomp osit ion algo r ithm. The main step of the tensor decomp ositio n metho d is the tensor p ower iter ation whic h is the generalization of matrix p o w er iteration to 3rd order tensors. The tensor p o we r iteration is giv en by u ← T ( I , u, u ) k T ( I , u, u ) k , where u ∈ R d , T ( I , u, u ) := P j,l ∈ [ d ] u j u l T (: , j, l ) ∈ R d is a multiline ar com bination of tensor ﬁb ers . Note that tensor ﬁb ers are the v ectors which are deriv ed by ﬁxing all th e indices of the tensor except one of them, e.g., T (: , j, l ) in the ab o v e expansion. The initializatio n for diﬀerent runs of 10 Compare with the matrix c ase where for M ∈ R d × d , w e ha ve M ( I , u ) = M u := P j ∈ [ d ] u j M ( : , j ). 28 Input: T ensor T = P i ∈ [ k ] λ i u ⊗ 3 i Whitening procedure (Pro c e dure 5) SVD-based I nitia lization (Pro cedure 8) T ensor Pow er Metho d (Algorithm 7 ) Output: { u i } i ∈ [ k ] Figure 3: Overview o f tensor de c omposition algorithm for third order tenso r (without tensorization). Algorithm 4 T ensor Decomp osition Algorithm Setup input symm etric tensor T . 1: if Wh itenin g t hen 2: Calculate T =Whiten( T ); see P rocedur e 5 . 3: else if T ensorizing t hen 4: T ensorize the inpu t tensor. 5: Calculate T =Whiten( T ); see P rocedur e 5 . 6: end if 7: for j = 1 to k do 8: ( v j , µ j , T ) = tensor p o we r d ecomp osition( T ); see Algorithm 7. 9: end for 10: ( A 1 ) j = Un-whiten( v j ) , j ∈ [ k ]; see Pro cedure 6. 11: return { ( A 1 ) j } j ∈ [ k ] . tensor p ow er iteratio n is p erformed by the SVD-based tec hnique prop osed in Pro cedure 8. This helps to initialize non-conv ex tensor p o w er iteration with go od initializat ion v ectors. The whitening prepro cessing is applied to orth ogonalize the comp onen ts of inp ut tensor. Note that the conv ergence guaran tees of tensor p ow er iteration for orthogonal tensor decomp osition ha ve b een develo p ed in the literature (Zhang and Golub, 2001; An andkumar et al., 2014b). The tensorization step works as follo ws. T ensorization: The tensorizing step is applied when w e wa nt to d ecomp ose ov er complete te n sors where th e rank k is larger than the dimen s io n d . F or instance, for getting rank u p to k = O ( d 2 ), w e ﬁrst form th e 6th order inpu t tensor with decomp osition as T = X j ∈ [ k ] λ j a ⊗ 6 j ∈ 6 O R d . Giv en T , we form the 3rd order tensor ˜ T ∈ N 3 R d 2 whic h is the tensorization of T such that ˜ T  i 2 + d ( i 1 − 1) , j 2 + d ( j 1 − 1) , l 2 + d ( l 1 − 1)  := T ( i 1 , i 2 , j 1 , j 2 , l 1 , l 2 ) . (30) 29 Pro cedure 5 Whitening input T ensor T ∈ R d × d × d . 1: Second order moment M 2 ∈ R d × d is constructed suc h that it has th e same decomp ositon form as target tensor T (see Section C.1.1 for more discussions): • Op tio n 1: constructed using second ord er score function; see Equation (39). • Op tio n 2: computed as M 2 := T ( I , I , θ ) ∈ R d × d , w here θ ∼ N (0 , I d ) is a random standard Gaussian v ector. 2: Compute the rank-k S VD, M 2 = U Diag( γ ) U ⊤ , where U ∈ R d × k and γ ∈ R k . 3: Compute the whitening m at r ix W := U Diag( γ − 1 / 2 ) ∈ R d × k . 4: return T ( W , W , W ) ∈ R k × k × k . Pro cedure 6 Un-wh ite n ing input Or thogo n al rank-1 comp onen ts v j ∈ R k , j ∈ [ k ]. 1: Consider matrix M 2 whic h w as exp loi ted for whitening in P rocedur e 5, and let ˜ λ j , j ∈ [ k ] den ote the corresp onding co eﬃcie nts as M 2 = A 1 Diag( ˜ λ ) A ⊤ 1 ; see (39). 2: Compute the rank-k S VD, M 2 = U Diag( γ ) U ⊤ , where U ∈ R d × k and γ ∈ R k . 3: Compute ( A 1 ) j = 1 q ˜ λ j U Diag( γ 1 / 2 ) v j , j ∈ [ k ] . 4: return { ( A 1 ) j } j ∈ [ k ] . This leads to ˜ T ha ving d ec omp osition ˜ T = X j ∈ [ k ] λ j ( a j ⊙ a j ) ⊗ 3 . W e then app ly the tensor decomp osition algorithm to this new tensor ˜ T . This no w clariﬁes why the full column rank condition is applied to the c olum n s of A ⊙ A = [ a 1 ⊙ a 1 · · · a k ⊙ a k ]. Similarly , w e can p erform higher ord er tensorizations leading to m ore o v ercomplete mo dels by exploiting initial higher order tensor T ; see also Remark 3. Eﬃcien t implementation of tensor decomp osition giv en samples: The main up date steps in the tensor decomp ositio n algorithm is the tensor p o we r iteration for w hic h a multilinear op er- ation is p erformed on tensor T . Ho wev er, the tensor is not a v ailable b eforehand, and needs to b e estimated using th e samples (as in Algorithm 1 in the main text). Computing and storing the tensor can b e enormously exp ensiv e for h igh-dimensional pr oblems. But, it is essen tial to note that since w e can form a factor form of tensor T u sing the samples and other parameters in the mo del, w e can manipulate the samples d ir ec tly to p erform the p ow er up date as multi-line ar op- erations without explicitly forming the tensor. This leads to eﬃcien t compu tat ional complexit y . See (Anand k u mar et al., 2014a) for d eta ils on these implicit up date forms. 30 Algorithm 7 Robu st tensor p o w er metho d (Anandku mar et al., 2014b) input symm etric tensor ˜ T ∈ R d ′ × d ′ × d ′ , num b er of iterations N , n u m b er of initializat ions R . output the estimated eigen vect or/eigen v alue pair; the deﬂ ated tensor. 1: for τ = 1 to R do 2: In iti alize b v ( τ ) 0 with SVD-based metho d in Pro cedure 8. 3: for t = 1 to N do 4: Compute p o wer iteration u p date b v ( τ ) t := ˜ T ( I , b v ( τ ) t − 1 , b v ( τ ) t − 1 ) k ˜ T ( I , b v ( τ ) t − 1 , b v ( τ ) t − 1 ) k (31) 5: end for 6: end for 7: Let τ ∗ := arg max τ ∈ [ R ] { ˜ T ( b v ( τ ) N , b v ( τ ) N , b v ( τ ) N ) } . 8: Do N p o w er iteration up dates (31) starting fr om b v ( τ ∗ ) N to obtain b v , and set ˆ µ := ˜ T ( b v , b v , b v ). 9: return the estimated eigen ve ctor/eigen v alue p ai r ( b v , ˆ µ ); the d eﬂate d tensor ˜ T − ˆ µ · ˆ v ⊗ 3 . Pro cedure 8 SVD-based in iti alization (Anandk u mar et al., 2014c) input T ensor T ∈ R d ′ × d ′ × d ′ . 1: for τ = 1 to log (1 / ˆ δ ) do 2: Draw a rand om standard Gaussian ve ctor θ ( τ ) ∼ N (0 , I d ′ ) . 3: Comp ute u ( τ ) 1 as the top left sin gular v ector of T ( I , I , θ ( τ ) ) ∈ R d ′ × d ′ . 4: end for 5: b v 0 ← max τ ∈ [ log( 1 / ˆ δ )]  u ( τ ) 1  min . 6: return b v 0 . C Pro of of Theorem 3 Pro of of Theorem 3 in clud es three main p ieces whic h is ab out arguing the reco v ery guaran tees of three diﬀerent parts of the algorithm: tensor decomp osition, F ourier m etho d, and linear r eg r essio n . As the ﬁr st piece, we sh ow that the tensor decomp ositio n algorithm for estimating weigh t matrix A 1 (see Algorithm 1 f or the details) reco vers it with the desired error. In the second part, we analyze the p erformance of F ourier tec hnique for estimati n g bias v ector b 1 (see Algorithm 1 and Pro cedure 2 for the detail s ) proving t h e error in the reco v ery is small. Finally as the last step, the ridge regression is analyzed to ensu re that the parameters of last la yer of the neural netw ork are w ell estimated leading to the estimation of o v erall fu nction ˜ f ( x ). W e now p ro vide the analysis of these three p arts. C.1 T ensor decomposition guaran t ees W e ﬁ rst provide a short pro of for Lemma 6 which shows ho w the rank-1 comp onents of third order tensor E [ ˜ y · S 3 ( x )] are th e columns of w eight matrix A 1 . Pro of of Lemma 6: It is sho wn by Janzamin et al. (2014) that the score fun cti on yields diﬀer- 31 en tial op erator such that for lab el-function f ( x ) := E [ y | x ], we ha v e E [ y · S 3 ( x )] = E [ ∇ (3) x f ( x )] . Applying this p rop ert y to the form of lab el fu nction f ( x ) in (7 ) denoted by ˜ f ( x ), w e hav e E [ ˜ y · S 3 ( x )] = E [ σ ′′′ ( · )( a 2 , A ⊤ 1 , A ⊤ 1 , A ⊤ 1 )] , where σ ′′′ ( · ) denotes the third order deriv ativ e of elemen t-wise f unction σ ( z ) : R k → R k . More concretely , w ith slight ly abu s e of n ot ation, σ ′′′ ( z ) ∈ R k × k × k × k is a diagonal 4th order tensor w ith its j -th d ia gonal en try equal to ∂ 3 σ ( z j ) ∂ z 3 j : R → R . Here t wo prop erties are u sed to compute the third order deriv ativ e ∇ (3) x ˜ f ( x ) on the R.H.S. of ab o v e equation as follo ws. 1) W e apply c hain rule to tak e the deriv ativ es whic h generates a new f ac tor of A 1 for eac h deriv ativ e. Sin ce w e tak e 3rd order deriv ativ e, w e ha ve 3 factors of A 1 . 2) Th e linearit y of next la yers leads to th e deriv ativ es from them b eing v anished, and thus, we only ha ve the ab o ve term as the deriv ativ e. Exp anding the ab o ve m ultilinear form ﬁn ishes the pro of; see (28) f or the d eﬁnition of m ultilinear form.  W e no w provide the reco v ery guaran tees of we ight matrix A 1 through tensor decomp ositio n as follo ws. Lemma 9. Am ong the c onditions for The or e m 3, c onsider the r ank c onstr aint on A 1 , and the non-vanishing assumpt i on on c o e ﬃcients λ j ’s. L et the whitening to b e p erforme d using e mpiric al version of se c ond or der sc or e function as sp e ciﬁe d in (39) , and assume the c o eﬃcients ˜ λ j ’s do not vanish. Su pp ose the sample c omplexity n ≥ max ( ˜ O ˜ y 2 max E h    M 3 ( x ) M ⊤ 3 ( x )    i ˜ λ 4 max ˜ λ 4 min s 2 max ( A 1 ) λ 2 min · s 6 min ( A 1 ) · 1 ˜ ǫ 2 1 ! , (32) ˜ O   ˜ y 2 max · E h    M 3 ( x ) M ⊤ 3 ( x )    i · ˜ λ max ˜ λ min ! 3 1 λ 2 min · s 6 min ( A 1 ) · k   , ˜ O ˜ y 2 max · E    S 2 ( x ) S ⊤ 2 ( x )    3 / 2 E    M 3 ( x ) M ⊤ 3 ( x )    1 / 2 · 1 ˜ λ 2 min · s 3 min ( A 1 ) !) , holds, wher e M 3 ( x ) ∈ R d × d 2 denotes the matricization of sc or e function tensor S 3 ( x ) ∈ R d × d × d ; se e (1) for the deﬁnition of matricizatio n. Then the estimate ˆ A 1 by NN-LIFT A lgorithm 1 satisﬁes w.h.p. min z ∈{± 1 } k ( A 1 ) j − z · ( ˆ A 1 ) j k ≤ ˜ O (˜ ǫ 1 ) , j ∈ [ k ] , wher e the r e c overy guar ante e is u p to the p ermutation of c olumns of A 1 . Remark 6 (Sign am biguit y) . We observe that i n add ition to the p ermutation ambiguity in the r e c overy guar ante es, ther e is also a sign ambigui ty issue in r e c overing the c olumns of matrix A 1 thr ough the de c omp osition of thir d or der tensor i n (19) . This is b e c ause the sign of ( A 1 ) j and c o eﬃcient λ j c an b oth c ha nge while the over al l tensor is stil l ﬁxe d. Note that the c o eﬃcie nt λ j c an b e p ositive or ne gative. A c c or ding to the F ourier metho d for estimating b 1 , mis- c alculating the sign of ( A 1 ) j also le ads to sign of b 1 ( j ) r e c over e d in the opp osite manner. In other wor ds, the r e c over e d sign of the bias b 1 ( j ) is c onsistent with the r e c over e d sign of ( A 1 ) j . 32 R e c al l we assume that the nonline ar activating f unction σ ( z ) satisﬁes the pr op erty such that σ ( z ) = 1 − σ ( − z ) . Many p opular activating functions such as step f u nction, sig moid fu nction and tanh function satisfy this pr op erty. Give n this pr op erty, the sign ambiguity in p ar ameters A 1 and b 1 which le ads to opp osite sign in input z to the activating function σ ( · ) c an b e now c omp ensate d by the sign of a 2 and value of b 2 , which is r e c over e d thr ough le ast squar es. Pro of of Lemma 9: F rom Lemma 6, w e kno w that the exact cross-momen t ˜ T = E [ ˜ y · S 3 ( x )] has rank-one comp onents as column s of matrix A 1 ; s ee Equ at ion (19) for th e tensor decomp osition form. W e apply a tensor d ecomp osition metho d in NN-LIFT to estimate the column s of A 1 . W e emplo y noisy tensor decomp osition guarant ees in Anandkumar et al. (2014c). T hey sho w that when the p erturbation tensor is small, the tensor p o w er iteration in iti alized by the SVD-based Pr o cedure 8 reco v ers the rank-1 comp onen ts up to some small error. W e also analyze the whitening step and com bine it with this result leading to Lemma 10. Let us no w c haracterize the p erturbation matrix and tensor. By Lemma 6, th e CP decomp osi- tion form is giv en by ˜ T = E [ ˜ y · S 3 ( x )], and thus, the p ertur batio n tensor is w r itte n as E := ˜ T − b T = E [ ˜ y · S 3 ( x )] − 1 n X i ∈ [ n ] ˜ y i · S 3 ( x i ) , (33) where b T = 1 n P i ∈ [ n ] ˜ y i · S 3 ( x i ) is the empirical f orm used in NN-LIFT Algorithm 1. Notice that in the realizable setting, the neural netw ork outpu t ˜ y is observed and th u s, it is u sed in forming the empirical tensor. Similarly , t h e p ertur batio n of second order momen t ˜ M 2 = E [ ˜ y · S 2 ( x )] is gi ven b y E 2 := ˜ M 2 − c M 2 = E [ ˜ y · S 2 ( x )] − 1 n X i ∈ [ n ] ˜ y i · S 2 ( x i ) . (34) In order to b ound k E k , we matricize it to apply matrix Bernstein’s inequalit y . W e hav e the matricized version as ˜ E := E [ ˜ y · M 3 ( x )] − 1 n X i ∈ [ n ] ˜ y i · M 3 ( x i ) = X i ∈ [ n ] 1 n  E [ ˜ y · M 3 ( x )] − ˜ y i · M 3 ( x i )  , where M 3 ( x ) ∈ R d × d 2 is the matricization of S 3 ( x ) ∈ R d × d × d ; see (1) for the deﬁnition of matri- cizatio n . No w the norm of ˜ E can b e b ounded by the matrix Bernstein’s inequ al ity . The norm of eac h (cen tered) random v ariable inside the su m mati on is b ounded as ˜ y max n E [ k M 3 ( x ) k ], wh ere ˜ y max is the b oun d on | ˜ y | . T h e v ariance term is also b ound ed as 1 n 2    X i ∈ [ n ] E h ˜ y 2 i · M 3 ( x i ) M ⊤ 3 ( x i ) i    ≤ 1 n ˜ y 2 max E h    M 3 ( x ) M ⊤ 3 ( x )    i . Applying matrix Bernstein’s inequalit y , we ha ve w.h.p. k E k ≤ k ˜ E k ≤ ˜ O  ˜ y max √ n q E    M 3 ( x ) M ⊤ 3 ( x )     . (35) F or the s ec ond order p erturbation E 2 , it is already a matrix, and b y applying matrix Bern stei n ’s inequalit y , we similarly argue that w .h.p. k E 2 k ≤ ˜ O  ˜ y max √ n q E    S 2 ( x ) S ⊤ 2 ( x )     . (36) 33 There is on e more r emai n ing piece to complete the pro of of tensor decomp osition p art. T he analysis in Anandkumar et al. (2014c) d oes not inv olv e an y whitening step, an d th u s, w e need to adapt the p erturbation analysis of Anandkum ar e t al. (2014c) to our additional whitening pro ce- dure. This is done in Lemma 10. In th e ﬁnal reco ve r y b ound (46) in Lemma 10, there are t w o terms; one inv olving k E k , and th e other in volving k E 2 k . W e ﬁrst imp ose a b ound on samp le complexit y suc h that the b ound in v olving k E k dominates the b oun d inv olving k E 2 k as follo w s. Considering the b ounds on k E k and k E 2 k in (35) and (36), an d imp osing th e lo wer b oun d on the num b er of samples (third b ound stated in th e lemma) as n ≥ ˜ O ˜ y 2 max · E    S 2 ( x ) S ⊤ 2 ( x )    3 / 2 E    M 3 ( x ) M ⊤ 3 ( x )    1 / 2 · 1 ˜ λ 2 min · s 3 min ( A 1 ) ! , leads to this goal. By doing this, we do not need to imp ose the b ound on k E 2 k anymore, and applying the p erturbation b oun d in (35 ) to the required b ound on k E k in Lemma 10 leads to sample complexit y b ound (seco n d b ound stated in the lemma) n ≥ ˜ O   ˜ y 2 max · E h    M 3 ( x ) M ⊤ 3 ( x )    i · ˜ λ max ˜ λ min ! 3 1 λ 2 min · s 6 min ( A 1 ) · k   . Finally , applying the resu lt of Lemma 10, we ha ve the column -w ise error guaran tees (up to p erm u - tation) k ( A 1 ) j − ( ˆ A 1 ) j k ≤ ˜ O   s max ( A 1 ) λ min ˜ λ 2 max p ˜ λ min ˜ y max ˜ λ 1 . 5 min · s 3 min ( A 1 ) q E    M 3 ( x ) M ⊤ 3 ( x )    √ n   ≤ ˜ O (˜ ǫ 1 ) , where in the ﬁrst inequalit y we also substituted the b oun d on k E k in (35), and the ﬁ rst boun d on n stated in the lemma is used in the last inequalit y .  C.1.1 Whitening analysis The p erturbation analysis of pr oposed tensor decomp osition metho d in Algorithm 7 with the corre- sp onding SVD-based in itialization in Pro cedure 8 is p ro vided in Anand kumar et al. (2014c). But, they do not consid er the eﬀect of whitening prop osed in Pro cedures 5 and 6. Th u s, w e need to adapt the p ertur bation analysis of An an d kumar et al. (2014 c ) when the whitening pro cedure is incorp orated. W e p erform it in this section. W e ﬁrst elaborate on the whitening step, and analyze h o w the prop osed Pr ocedure 5 works. W e th en analyze the in v ersion of whitening operator sho wing ho w the comp onen ts in the whitened space are translated bac k to the original sp ac e as stated in Pro cedure 6. W e ﬁnally pr o vid e the p erturbation analysis of whitening step w hen estimations of moments are give n . Whitening pro cedure: Consider second order momen t ˜ M 2 whic h is used to whiten third order tensor ˜ T = X j ∈ [ k ] λ j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j (37) 34 in Pro cedu re 5. It is constructed suc h that it has the same decomp osition form as target tensor ˜ T , i.e., we ha ve ˜ M 2 = X j ∈ [ k ] ˜ λ j · ( A 1 ) j ⊗ ( A 1 ) j . (38) W e prop ose tw o options for constructing ˜ M 2 in Pro cedure 5. F irs t option is to use second order score fun cti on and construct ˜ M 2 := E [ ˜ y · S 2 ( x )] for wh ic h w e hav e ˜ M 2 := E [ ˜ y · S 2 ( x )] = X j ∈ [ k ] ˜ λ j · ( A 1 ) j ⊗ ( A 1 ) j , (39) where ˜ λ j = E  σ ′′ ( z j )  · a 2 ( j ) , (40) for v ector z := A ⊤ 1 x + b 1 as the input to th e n onlinear op erator σ ( · ). T h is is pr o v ed similar to Lemma 6. S eco n d option leads to the same form for ˜ M 2 as (38) with co eﬃcien t m odiﬁed as ˜ λ j = λ j · h ( A 1 ) j , θ i . Let matrix W ∈ R d × k denote the whitening matrix in the noiseless case, i.e., the whitening matrix W in Pro cedure 5 is constructed such that W ⊤ ˜ M 2 W = I k . Applying wh itenin g matrix W to the noiseless tensor ˜ T = P j ∈ [ k ] λ j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j , w e ha ve ˜ T ( W, W , W ) = X j ∈ [ k ] λ j  W ⊤ ( A 1 ) j  ⊗ 3 = X j ∈ [ k ] λ j ˜ λ 3 / 2 j  W ⊤ ( A 1 ) j q ˜ λ j  ⊗ 3 = X j ∈ [ k ] µ j v ⊗ 3 j , (41) where we deﬁne µ j := λ j ˜ λ 3 / 2 j , v j := W ⊤ ( A 1 ) j q ˜ λ j , j ∈ [ k ] , (42) in th e last equalit y . Let V := [ v 1 v 2 · · · v k ] ∈ R k × k denote the factor matrix for ˜ T ( W, W , W ). W e ha ve V := W ⊤ A 1 Diag( ˜ λ 1 / 2 ) , (43) and thus, V V ⊤ = W ⊤ A 1 Diag( ˜ λ ) A ⊤ 1 W = W ⊤ ˜ M 2 W = I k . Since V is a sq u are matrix, it is also concluded that V ⊤ V = I k , and therefore, tensor ˜ T ( W, W , W ) i s whitened suc h that the rank-1 comp onen ts v j ’s form an orthonormal basis. This discussion clariﬁes ho w the wh ite n ing pro cedure w orks. In version of the w hitening pro cedure: Let u s also analyze the inv ersion pro cedure on ho w to transform v j ’s to ( A 1 ) j ’s. The main step is stated in Pro cedure 6. According to whitening Pro- cedure 5, let ˜ M 2 = U Dia g ( γ ) U ⊤ , U ∈ R d × k , γ ∈ R k , denote the rank-k S VD of ˜ M 2 . S ubstituting whitening matrix W := U Diag( γ − 1 / 2 ) in (43), and multiplying U Diag ( γ 1 / 2 ) from left, we ha v e U Diag( γ 1 / 2 ) V = U U ⊤ A 1 Diag( ˜ λ 1 / 2 ) . Since the column spans of A 1 ∈ R d × k and U ∈ R d × k are the s ame (giv en their relatio n s to ˜ M 2 ), A 1 is a ﬁ x ed p oin t for the pro j ec tion op erator on the s ubspace spanned by th e columns of U . 35 This pro jector op erator is U U ⊤ (since columns of U form an orthonormal basis), and therefore, U U ⊤ A 1 = A 1 . Applying this to the ab o v e equation, we ha ve A 1 = U Diag( γ 1 / 2 ) V Diag ( ˜ λ − 1 / 2 ) , i.e., ( A 1 ) j = 1 q ˜ λ j U Diag( γ 1 / 2 ) v j , j ∈ [ k ] . (44) The ab o v e discu s sions describ e the details of whitening and unwhitening pro cedures. W e now pro vid e the guarante es of tensor d ecomp osition give n n oi s y versions of moment s ˜ M 2 and ˜ T . Lemma 10. L et c M 2 = ˜ M 2 − E 2 and b T = ˜ T − E r esp e ctive ly denote the noisy versions of ˜ M 2 = X j ∈ [ k ] ˜ λ j · ( A 1 ) j ⊗ ( A 1 ) j , ˜ T = X j ∈ [ k ] λ j · ( A 1 ) j ⊗ ( A 1 ) j ⊗ ( A 1 ) j . (45) Assume the se c ond and thir d or der p erturb ations satisfy the b ounds k E 2 k ≤ ˜ O λ 1 / 3 min ˜ λ 7 / 6 min p ˜ λ max s 2 min ( A 1 ) 1 k 1 / 6 ! , k E k ≤ ˜ O   λ min ˜ λ min ˜ λ max ! 1 . 5 s 3 min ( A 1 ) 1 √ k   . Then, the pr op ose d t e nsor de c omp osition algorithm r e c overs estimations of r ank-1 c omp onents ( A 1 ) j ’s satisfying err or k ( A 1 ) j − ( ˆ A 1 ) j k ≤ ˜ O s max ( A 1 ) λ min · ˜ λ 2 max p ˜ λ min · " k E 2 k 3 ˜ λ 3 . 5 min · s 6 min ( A 1 ) + k E k ˜ λ 1 . 5 min · s 3 min ( A 1 ) #! , j ∈ [ k ] . (4 6) Pro of: W e do not ha ve access to the true matrix ˜ M 2 and the true tensor ˜ T , an d the p erturb ed v ersions c M 2 = ˜ M 2 − E 2 and b T = ˜ T − E are us ed in the wh ite n ing p rocedur e. Here, E 2 ∈ R d × d denotes the p erturbation matrix, and E ∈ R d × d × d denotes the p erturbation tensor. S imila r to the noiseless case, let c W ∈ R d × k denotes the whitening matrix constructed by Pro cedure 5 su c h that c W ⊤ c M 2 c W = I k , and thus it orthogonalizes th e noisy matrix c M 2 . Ap plying the wh itening matrix c W to the tensor b T , we ha ve b T ( c W , c W , c W ) = ˜ T ( W, W , W ) − ˜ T ( W − c W , W − c W , W − c W ) − E ( c W , c W , c W ) = X j ∈ [ k ] µ j v ⊗ 3 j − E W , (47) where we used Equation (41) , and w e deﬁned E W := ˜ T ( W − c W , W − c W , W − c W ) + E ( c W , c W , c W ) (48) as the p erturbation tensor after whitening. Note that the p erturbation is from t wo sources; one is from the err or in computing whitening matrix reﬂected in W − c W , and the other is the error in tensor b T reﬂected in E . 36 W e kno w that the rank-1 comp onen ts v j ’s form an orthonormal basis, and thus, we hav e a noisy orthogo n al te n sor deco m position problem in (47). W e apply the resu lt of Anandkumar et al. (2014 c ) wher e they sh o w that if k E W k ≤ µ min √ log k α 0 √ k , for some constan t α 0 > 1, th en the te n sor p ow er iteration (applied to the wh ite n ed tensor) reco v ers the tensor r ank-1 comp onen ts with b oun ded err or (up to the p erm u ta tion of columns) k v j − b v j k ≤ ˜ O  k E W k µ min  . (49) W e no w relate th e norm of E W to the n orm of original p erturb at ions E and E 2 . F or the ﬁrst term in (48 ), from Lemmata 4 and 5 of Song et al. (2013), we ha ve k ˜ T ( W − c W , W − c W , W − c W ) k ≤ 64 k E 2 k 3 ˜ λ 3 . 5 min · s 6 min ( A 1 ) . F or the second term, by the sub -m ultiplicativ e p rop ert y w e ha ve k E ( c W , c W , c W ) k ≤ k E k · k c W k 3 ≤ 8 k E k · k W k 3 ≤ 8 k E k s 3 min ( A 1 ) ˜ λ 3 / 2 min . Here in th e last inequalit y , w e used k W k = 1 q s k ( ˜ M 2 ) ≤ 1 s min ( A 1 ) p ˜ λ min , where s k ( ˜ M 2 ) denotes the k -th largest singular v alue of ˜ M 2 . Here, th e equ al ity is from the d eﬁnition of W based on rank- k SVD of ˜ M 2 in Pro cedure 5, and the inequalit y is from ˜ M 2 = A 1 Diag( ˜ λ ) A ⊤ 1 . Substituting th ese b ounds, w e ﬁnally need the condition 64 k E 2 k 3 ˜ λ 3 . 5 min · s 6 min ( A 1 ) + 8 k E k ˜ λ 1 . 5 min · s 3 min ( A 1 ) ≤ λ min √ log k α 0 ˜ λ 1 . 5 max √ k , where w e also subs tit u ted b ound µ min ≥ λ min / ˜ λ 1 . 5 max , giv en Eq u ati on (42). The b ounds stated in the lemma ensures that eac h of the terms on the left h an d side of the in equali ty are b ound ed by the righ t hand sid e. Thus, by the result of An andkumar et al. (2 014c ), we hav e k v j − b v j k ≤ ˜ O ( k E W k /µ min ). On the other hand , by the unwhitening relationship in (44), we ha ve k ( A 1 ) j − ( ˆ A 1 ) j k = 1 q ˜ λ j k Diag( γ 1 / 2 ) · [ v j − b v j ] k ≤ r γ max ˜ λ min · k v j − b v j k ≤ s max ( A 1 ) · s ˜ λ max ˜ λ min · k v j − b v j k . (50) where in the equalit y , we use the fact that orthonormal m atrix U preserv es the ℓ 2 norm, and the sub-multiplica tive prop ert y is exploited in the ﬁrst in equ al ity . Th e last inequalit y is also from γ max = s max ( ˜ M 2 ) ≤ s 2 max ( A 1 ) · ˜ λ max , which is from ˜ M 2 = A 1 Diag( ˜ λ ) A ⊤ 1 . Incorp orating the error b ound on k v j − b v j k in (49) , w e h a ve k ( A 1 ) j − ( ˆ A 1 ) j k ≤ ˜ O   s max ( A 1 ) · s ˜ λ max ˜ λ min · k E W k µ min   ≤ ˜ O s max ( A 1 ) λ min · ˜ λ 2 max p ˜ λ min · k E W k ! , where we used the b ound µ min ≥ λ min / ˜ λ 1 . 5 max in the last step.  37 C.2 F ourier analysis guaran t ees The analysis of F our ier method f or e stimating parameter b 1 includes the follo wing tw o lemmas. In the ﬁrs t lemma, we argue the mean of r andom v ariable v introd uced in Algorithm 1 in the realizable setting. Th is clariﬁes why the phase of v is related to unkn o w n parameter b 1 . In the sec ond l emm a, w e argue the concen tration of v around its mean leading to the samp le complexit y result. Note that v is den oted by ˜ v in th e realizable s etting. The F ourier metho d can b e also used to estimate the w eight vec tor a 2 since it app ears in the magnitude of complex n u m b er v . In this section, w e also provi d e the analysis of estimating a 2 with F ourier metho d whic h can b e u sed as an alternativ e, while we primarily estimate a 2 b y the ridge regression analyzed in App endix C.3. Lemma 7 (Restated) . L et ˜ v := 1 n X i ∈ [ n ] ˜ y i p ( x i ) e − j h ω i ,x i i . (51) Notic e this is a r e alizable of v in Pr o c e dur e 2 wher e the output c orr esp onds to a neur al ne twork ˜ y . If ω i ’s ar e uniformly i.i.d. dr awn fr om se t Ω l , then ˜ v has me an (which is c ompute d over x , ˜ y and ω ) E [ ˜ v ] = 1 | Ω l | Σ  1 2  a 2 ( l ) e j π b 1 ( l ) , (52) wher e | Ω l | denotes the surfac e ar e a of d − 1 dimensional manifold Ω l , and Σ( · ) denotes the F ourier tr ansform of σ ( · ) . Pro of: Let ˜ F ( ω ) d enote the F ourier tr an s form of lab el fu nction ˜ f ( x ) := E [ ˜ y | x ] = h a 2 , σ ( A ⊤ 1 x + b 1 ) i which is (Marks I I and Arabshahi, 1994) ˜ F ( ω ) = X j ∈ [ k ] a 2 ( j ) | A 1 ( d, j ) | Σ  ω d A 1 ( d, j )  e j 2 π b 1 ( j ) ω d A 1 ( d,j ) δ  ω − − ω d A 1 ( d, j ) A 1 ( \ d, j )  , (53) where Σ( · ) is the F ourier transform of σ ( · ), u ⊤ − = [ u 1 , u 2 , . . . , u d − 1 ] is v ector u ⊤ with the last en try remo ve d , A 1 ( \ d, j ) ∈ R d − 1 is the j -th column of matrix A 1 with t h e d -th ( last) en try remo ved, and ﬁnally δ ( u ) = δ ( u 1 ) δ ( u 2 ) · · · δ ( u d ). Let p ( ω ) d enote the probabilit y density function of frequ ency ω . W e ha ve E [ ˜ v ] = E x, ˜ y, ω  ˜ y p ( x ) e − j h ω ,x i  = E x,ω  E ˜ y |{ x,ω }  ˜ y p ( x ) e − j h ω ,x i    x, ω  = E x,ω " ˜ f ( x ) p ( x ) e − j h ω ,x i # = Z Ω l Z ˜ f ( x ) e − j h ω ,x i p ( ω ) dxdω = Z Ω l ˜ F ( ω ) p ( ω ) dω , 38 where th e second equalit y uses the la w of total exp ectat ion, the third equalit y exploits the lab el- generating function d eﬁnition ˜ f ( x ) := E [ ˜ y | x ], and th e ﬁn al equalit y is from the deﬁ n itio n of F ourier transform. The v ariable ω ∈ R d is d r a wn fr om a d − 1 dimensional m anifold Ω l ⊂ R d . In order to compute the ab o ve inte gral, we deﬁne d d imensional set Ω l ; ν :=  ω ∈ R d : 1 2 − ν 2 ≤ k ω k ≤ 1 2 + ν 2 ,   h ω , ( ˆ A 1 ) l i   ≥ 1 − ˜ ǫ 2 1 / 2 2  , for which Ω l = lim ν → 0 + Ω l ; ν . Assuming ω ’s are uniformly drawn from Ω l ; ν , we ha v e E [ ˜ v ] = lim ν → 0 + Z Ω l ; ν ˜ F ( ω ) p ( ω ) dω = lim ν → 0 + 1 | Ω l ; ν | Z + ∞ −∞ ˜ F ( ω )1 Ω l ; ν ( ω ) dω . The se cond equalit y is fr om uniform dr a ws of ω from se t Ω l ; ν suc h that p ( ω ) = 1 | Ω l ; ν | 1 Ω l ; ν ( ω ), wh ere 1 S ( · ) denotes the indicator f unction for set S . Here, | Ω l ; ν | denotes the v olume of d dimensional subspace Ω l ; ν , for wh ic h in th e limit ν → 0 + , w e ha v e | Ω l ; ν | = ν · | Ω l | , wh ere | Ω l | denotes the surface area of d − 1 dimensional manifold Ω l . F or small enough ˜ ǫ 1 in the deﬁnition of Ω l ; ν , only th e delta function for j = l in th e expansion of ˜ F ( ω ) in (53) is survive d from the ab o v e inte gral, and th u s , E [ ˜ v ] = lim ν → 0 + 1 | Ω l ; ν | Z + ∞ −∞ a 2 ( l ) | A 1 ( d, l ) | Σ  ω d A 1 ( d, l )  e j 2 π b 1 ( l ) ω d A 1 ( d,l ) δ  ω − − ω d A 1 ( d, l ) A 1 ( \ d, l )  1 Ω l ; ν ( ω ) dω . In order to simp lify the notations, in the r est of th e pro of we d en ot e l -th column of matrix A 1 b y v ector α , i.e., α := ( A 1 ) l . Thus, the goal is to compute the in tegral I := Z + ∞ −∞ 1 | α d | Σ  ω d α d  e j 2 π b 1 ( l ) ω d α d δ  ω − − ω d α d α −  1 Ω l ; ν ( ω ) dω , and note that E [ ˜ v ] = a 2 ( l ) · lim ν → 0 + I | Ω l ; ν | . T he rest of the pro of is ab out computing the ab o v e in tegral. The in tegral inv olv es d elt a f unctions where the ﬁnal v alue is exp ected to b e computed at a single p oin t sp eciﬁed by the in tersection of line ω − = ω d α d α − , and sphere k ω k = 1 2 (when we consider the limit ν → 0 + ). Th is is based on the follo wing in tegration prop ert y of delta fun ct ions suc h that for function g ( · ) : R → R , Z + ∞ −∞ g ( t ) δ ( t ) dt = g (0) . (54) W e ﬁrst expand the delta function as follo w s. I = Z + ∞ −∞ 1 | α d | Σ  ω d α d  e j 2 π b 1 ( l ) ω d α d δ  ω 1 − α 1 α d ω d  · · · δ  ω d − 1 − α d − 1 α d ω d  1 Ω l ; ν ( ω ) dω , = Z · · · Z + ∞ −∞ Σ  ω d α d  e j 2 π b 1 ( l ) ω d α d δ  ω 1 − α 1 α d ω d  · · · δ  ω d − 2 − α d − 2 α d ω d  1 Ω l ; ν ( ω ) · δ ( α d ω d − 1 − α d − 1 ω d ) dω 1 · · · ω d , 39 where w e u sed the prop ert y 1 | β | δ ( t ) = δ ( β t ) in the second e qu ali ty . Intro d ucing new v ariable z , and applying the c hange of v ariable ω d = 1 α d − 1 ( α d ω d − 1 − z ), w e h a ve I = Z · · · Z + ∞ −∞ Σ  ω d α d  e j 2 π b 1 ( l ) ω d α d δ  ω 1 − α 1 α d ω d  · · · δ  ω d − 2 − α d − 2 α d ω d  1 Ω l ; ν ( ω ) · δ ( z ) d ω 1 · · · dω d − 1 dz α d − 1 , = Z · · · Z + ∞ −∞ 1 α d − 1 Σ  ω d − 1 α d − 1  e j 2 π b 1 ( l ) ω d − 1 α d − 1 δ  ω 1 − α 1 α d − 1 ω d − 1  · · · δ  ω d − 2 − α d − 2 α d − 1 ω d − 1  1 Ω l ; ν  ω 1 , ω 2 , . . . , ω d − 1 , α d α d − 1 ω d − 1  dω 1 · · · dω d − 1 . F or the sak e of simplifying the mathematic al n ot ations, we did not sub stitute all the ω d ’s with z in the ﬁrst equalit y , bu t n ote that all ω d ’s are imp lic itly a fu nction o f z which is ﬁnally considered in the second equalit y w here the d elt a in tegration prop ert y in (54) is applied to v ariable z (note that z = 0 is the same as ω d α d = ω d − 1 α d − 1 ). Rep eating the ab o ve pro cess sev eral times, we ﬁnally ha ve I = Z + ∞ −∞ 1 α 1 Σ  ω 1 α 1  e j 2 π b 1 ( l ) ω 1 α 1 · 1 Ω l ; ν  ω 1 , α 2 α 1 ω 1 , . . . , α d − 1 α 1 ω 1 , α d α 1 ω 1  dω 1 . There is a line constrain t as ω 1 α 1 = ω 2 α 2 = · · · = ω d α d in the argument of indicator fu nction. This implies that k ω k = k α k α 1 ω 1 = ω 1 α 1 , where we used k α k = k ( A 1 ) l k = 1. Incorp orating th is in the norm b ound imp osed b y the deﬁnition of Ω l ; ν , we ha ve α 1 2 (1 − ν ) ≤ ω 1 ≤ α 1 2 (1 + ν ), and hence, I = Z α 1 2 (1+ ν ) α 1 2 (1 − ν ) 1 α 1 Σ  ω 1 α 1  e j 2 π b 1 ( l ) ω 1 α 1 dω 1 . W e know E [ ˜ v ] = a 2 ( l ) · lim ν → 0 + I | Ω l ; ν | , and thus, E [ ˜ v ] = a 2 ( l ) · 1 ν · | Ω l | · α 1 ν 1 α 1 Σ  1 2  e j 2 π b 1 ( l ) 1 2 = 1 | Ω l | a 2 ( l )Σ  1 2  e j π b 1 ( l ) , where in the ﬁrst step w e use | Ω l ; ν | = ν · | Ω l | , and write the in tegral I in the limit ν → 0 + . Th is ﬁnishes the pro of.  In the follo wing lemma, w e argue the concen tration of v around its mean whic h leads to the sample complexit y b ound for estimating the parameter b 1 (and also a 2 ) within the desired error. Lemma 11. If the sample c omplexity n ≥ O ˜ ζ ˜ f ψ ˜ ǫ 2 2 log k δ ! (55) holds for smal l enough ˜ ǫ 2 ≤ ˜ ζ ˜ f , then the estimates ˆ a 2 ( l ) = | Ω l | | Σ(1 / 2) | | ˜ v | , and ˆ b 1 ( l ) = 1 π ( ∠ ˜ v − ∠ Σ(1 / 2)) for l ∈ [ k ] , in NN-LIFT A lgorithm 1 (se e the deﬁnition of ˜ v in (51) ) satisfy with pr ob ability at le ast 1 − δ , | a 2 ( l ) − ˆ a 2 ( l ) | ≤ | Ω l | | Σ(1 / 2) | O (˜ ǫ 2 ) , | b 1 ( l ) − ˆ b 1 ( l ) | ≤ | Ω l | π | Σ(1 / 2) || a 2 ( l ) | O (˜ ǫ 2 ) . 40 Pro of: T he result is prov ed by arguing the concen tration of v ariable ˜ v in (51) aroun d its m ea n c haracterized in (52 ). W e u se the Bernstein’s inequalit y to do this. Let ˜ v := P i ∈ [ n ] ˜ v i where ˜ v i = 1 n ˜ y i p ( x i ) e − j h ω i ,x i i . By the lo wer b ound p ( x ) ≥ ψ assumed in Theorem 3 and lab els ˜ y i ’s b eing b ounded, the m ag n itude of cen tered ˜ v i ’s ( ˜ v i − E [ ˜ v i ]) are b ounded by O ( 1 ψn ). The v ariance term is also b ounded as σ 2 =    X i ∈ [ n ] E h ( ˜ v i − E [ ˜ v i ])( ˜ v i − E [ ˜ v i ]) i    , where u denotes th e complex conju ga te of complex num b er u . This is b oun ded as σ 2 ≤ X i ∈ [ n ] E  ˜ v i ˜ v i  = 1 n 2 X i ∈ [ n ] E  ˜ y 2 i p ( x i ) 2  Since output ˜ y is a binary lab el ( ˜ y ∈ { 0 , 1 } ), we ha ve E [ ˜ y 2 | x ] = E [ ˜ y | x ] = ˜ f ( x ), and thus, E  ˜ y 2 p ( x ) 2  = E  E  ˜ y 2 p ( x ) 2 | x  = E " ˜ f ( x ) p ( x ) 2 # ≤ 1 ψ Z R d ˜ f ( x ) dx = ˜ ζ ˜ f ψ , where the in equalit y uses the b ound p ( x ) ≥ ψ and the last equalit y is f r om d eﬁnition of ˜ ζ ˜ f . This pro vid es us the b ound on v ariance as σ 2 ≤ ˜ ζ ˜ f ψ n . Applying Bernstein’s inequality concludes the concen tration b ound such that with p robabilit y at least 1 − δ , w e h a ve | ˜ v − E [ ˜ v ] | ≤ O   1 ψ n log 1 δ + s ˜ ζ ˜ f ψ n log 1 δ   ≤ O (˜ ǫ 2 ) , where t h e last in equalit y is from sample co m plexit y b ound . This implies that || ˜ v | − | E [˜ v ] || ≤ O (˜ ǫ 2 ). Substituting | E [˜ v ] | from (52) and consid er in g estima te ˆ a 2 ( l ) = | Ω l | | Σ(1 / 2) | | ˜ v | , w e ha ve | ˆ a 2 ( l ) − a 2 ( l ) | ≤ | Ω l | | Σ(1 / 2) | O (˜ ǫ 2 ) , whic h ﬁnishes the ﬁrst p art of th e pro of. F or the phase, w e ha ve φ := ∠ ˜ v − ∠ E [ ˜ v ] = π ( ˆ b 1 ( l ) − b 1 ( l )) . On the other h an d , for small enough err or ˜ ǫ 2 (and th us small φ ), w e ha ve the app ro ximation φ ∼ tan ( φ ) ∼ | ˜ v − E [ ˜ v ] | | E [ ˜ v ] | (note that this is actually an up p er boun d su c h that φ ≤ tan( φ )). Th us, | ˆ b 1 ( l ) − b 1 ( l ) | ≤ 1 π | E [ ˜ v ] | O (˜ ǫ 2 ) ≤ | Ω l | π | Σ(1 / 2) || a 2 ( l ) | O (˜ ǫ 2 ) . This ﬁn ishes the pro of of second b ound .  41 C.3 Ridge regression analysis and guaran t ees Let h := σ ( A ⊤ 1 x + b 1 ) denote the neuron or hidden la ye r v ariable. With slightly abuse of notatio n , in th e rest of analysis in this section, we app end v ariable h by the dum my v ariable 1 to r ep resen t the bias, and thus, h ∈ R k +1 . W e write the output as ˜ y = h ⊤ β + η , where β := [ a 2 , b 2 ] ∈ R k +1 . Giv en the estimated p arameters of ﬁr st la y er denoted by ˆ A 1 and ˆ b 1 , the neurons are estimated as ˆ h := σ ( ˆ A ⊤ 1 x + ˆ b 1 ). In add itio n , the dummy v ariable 1 is also app ended, and th us, ˆ h ∈ R k +1 . Because of this estimated enco ding of neurons, we expand the outp u t ˜ y as ˜ y = ˆ h ⊤ β + ( h ⊤ − ˆ h ⊤ ) β | {z } bias (approximation): b ( ˆ h ) + η |{z} noise = ˜ f ( ˆ h ) + η , (56) where ˜ f ( ˆ h ) := E [ ˜ y | ˆ h ] = ˆ h ⊤ β + b ( ˆ h ). Here, w e ha v e a noisy linear mo del with additional bias (appro ximation). Let ˆ β λ denote th e rid ge regression estimator for some regularization parameter λ ≥ 0, whic h is deﬁned as the minimizer of the r eg u la r ize d emp irica l mean squ ared error, i.e., ˆ β λ := arg min β 1 n X i ∈ [ n ]  h β , ˆ h i i − ˜ y i  2 + λ k β k 2 . W e know this estimator is give n b y (when ˆ Σ ˆ h + λI ≻ 0) ˆ β λ =  ˆ Σ ˆ h + λI  − 1 · ˆ E ( ˆ h ˜ y ) , where ˆ Σ ˆ h := 1 n P i ∈ [ n ] ˆ h i ˆ h ⊤ i is the empirical co v ariance of ˆ h , and ˆ E denotes the emp irica l mean op erator. Th e analysis of r idge regression leads to the follo wing exp ected prediction error (risk) b ound on the estimation of th e output. Lemma 12 (Exp ected p redictio n error of ridge regression) . Supp ose the p ar ameter r e c overy r esults in L emmata 9 and 11 on A 1 and b 1 hold. In addition, assume the nonline ar activating function σ ( · ) satisﬁes the Lipschitz pr op erty suc h that | σ ( u ) − σ ( u ′ ) | ≤ L · | u − u ′ | , for u, u ′ ∈ R . The fol lowing noise, appr oximation and statistic al lever age c onditions also hold. Then, by cho osing the optimal λ > 0 in the λ - r e gularize d ridge r e g r ession (which estimates the p ar ameters ˆ a 2 and ˆ b 2 ), the estimate d output as ˆ f ( x ) = ˆ a ⊤ 2 σ ( ˆ A ⊤ 1 x + ˆ b 1 ) + ˆ b 2 satisﬁes the risk b ound E [ | ˆ f ( x ) − ˜ f ( x ) | 2 ] ≤ O  k k β k 2 n  + O r 2 k k β k 2 n  E [ b ( ˆ h ) 2 ] + σ 2 noise / 2  ! + E [ b ( ˆ h ) 2 ] , wher e E [ b ( ˆ h ) 2 ] ≤  r + | Ω l | π | Σ(1 / 2) || a 2 ( l ) |  2 k β k 2 L 2 k O ( ˜ ǫ 2 ) . Pro of: Since ˆ h := σ ( ˆ A ⊤ 1 x + ˆ b 1 ), we equiv alen tly argue the b ound on E [( ˆ h ⊤ ˆ β λ − ˜ f ( ˆ h )) 2 ], where ˆ f ( x ) = ˆ f ( ˆ h ) = ˆ h ⊤ ˆ β λ . F rom standard results in the study of in ve r se p roblems, w e kn ow (see Prop osition 5 in Hsu et al. (2014 )) E [( ˆ h ⊤ ˆ β λ − ˜ f ( ˆ h )) 2 ] = E [( ˆ h ⊤ β − ˜ f ( ˆ h )) 2 ] + k ˆ β λ − β k 2 Σ ˆ h . 42 Here, f or p ositiv e deﬁnite m atrix Σ ≻ 0, the ve ctor norm k · k Σ is d eﬁ ned as k v k Σ := √ v ⊤ Σ v . F or the ﬁr st t erm , b y the deﬁ n itio n of ˜ f ( ˆ h ) as ˜ f ( ˆ h ) := E [ ˜ y | ˆ h ] = ˆ h ⊤ β + b ( ˆ h ), we ha v e E [( ˆ h ⊤ β − ˜ f ( ˆ h )) 2 ] = E [ b ( ˆ h ) 2 ] . Lemma 13 b ounds E [ b ( ˆ h ) 2 ] and b ounding k ˆ β λ − β k 2 Σ ˆ h is argued in Lemma 14 and Remark 7. Com bin ing these b ounds ﬁnishes the pro of.  In order to ha ve ﬁnal risk b ound ed as E [ | ˆ f ( x ) − ˜ f ( x ) | 2 ] ≤ ˜ O ( ǫ 2 ), for some ǫ > 0, th e ab o ve lemma imp oses sample complexit y as (some of other parameters considered in (32), (55) are n ot rep eated here) n ≥ ˜ O  L k k β k 2 ǫ 2 (1 + σ 2 noise )  . (57) Lemma 13 (Bounded app ro ximation) . Supp ose the p ar ameter r e c overy r esults in L emmata 9 and 11 on A 1 and b 1 hold. In addition, assume the nonline ar activating func tion σ ( · ) satisﬁes the Lips c hitz pr op erty such that | σ ( u ) − σ ( u ′ ) | ≤ L · | u − u ′ | , for u, u ′ ∈ R . Then, the appr oximation term is b ounde d as E [ b ( ˆ h ) 2 ] ≤  r + | Ω l | π | Σ(1 / 2) || a 2 ( l ) |  2 k β k 2 L 2 k O ( ˜ ǫ 2 ) . Pro of: W e ha ve E [ b ( ˆ h ) 2 ] = E [ h h − ˆ h, β i 2 ] ≤ k β k 2 · E [ k h − ˆ h k 2 ] . (58) Deﬁne ˜ ǫ := max { ˜ ǫ 1 , ˜ ǫ 2 } , where ˜ ǫ 1 and ˜ ǫ 2 are the corresp onding b ounds in Lemmata 9 and 11, resp ectiv ely . Using the Lipsc h itz prop ert y of n onlinear function σ ( · ), we ha ve | h l − ˆ h l | = | σ ( h ( A 1 ) l , x i + b 1 ( l )) − σ ( h ( ˆ A 1 ) l , x i + ˆ b 1 ( l )) | ≤ L · h |h ( A 1 ) l − ( ˆ A 1 ) l , x i| + | b 1 ( l ) − ˆ b 1 ( l ) | i ≤ L ·  r O ( ˜ ǫ ) + | Ω l | π | Σ(1 / 2) || a 2 ( l ) | O (˜ ǫ )  , where in t h e s econd inequalit y , we use the b ounds in Lemmata 9 and 1 1 , a n d b ound ed x suc h that k x k ≤ r . App lying this to (58) concludes the pro of.  W e now assu m e the f ol lowing additional conditions to b ound k ˆ β λ − β k 2 Σ ˆ h . The follo win g dis- cussions are along the r esults of Hsu et al. (2014). W e deﬁne the eﬀectiv e dimensions of the cov ariate ˆ h as k p,λ := X j ∈ [ k ]  λ j λ j + λ  p , p ∈ { 1 , 2 } , where λ j ’s denote the (p ositiv e) eigen v alues of Σ ˆ h , an d λ is the regularization parameter of rid ge regression. • S ubgaussian noise: there exists a ﬁnite σ noise ≥ 0 such that, almost sur ely , E η [exp( αη ) | ˆ h ] ≤ exp( α 2 σ 2 noise / 2) , ∀ α ∈ R , where η denotes the noise in the outpu t ˜ y . 43 • Bound ed statistic al lev erage: th ere exists a ﬁnite ρ λ ≥ 1 such that, almost s urely , √ k p (inf { λ j } + λ ) k 1 ,λ ≤ ρ λ . • Bound ed appro ximation error at λ : th ere exists a ﬁn ite B bias ,λ ≥ 0 suc h that, almost surely , ρ λ  B max + √ k k β k  ≤ B bias ,λ , where | b ( ˆ h ) | ≤ B max . Note that the app ro ximation term b ( ˆ h ) is b ound ed in Lemma 13. The parameter B bias ,λ only con tribu tes to the low er ord er terms in the a n al y s is of r idge regression. Lemma 14 (Bounding excess mean squared error: Th eo r em 2 of Hsu et al. (2014)) . Fix some λ ≥ 0 , and supp ose the ab ove noise, appr oximation and statistic al lever age c onditions hold, and in addition, n ≥ ˜ O ( ρ 2 λ k 1 ,λ ) . (59) Then, we have k ˆ β λ − β k 2 Σ ˆ h ≤ ˜ O  k λn  E [ b ( ˆ h ) 2 ] + σ 2 noise / 2  + λ k β k 2 2  1 + k /λ + 1 n  + o (1 /n ) , wher e E [ b ( ˆ h ) 2 ] is b ounde d i n L emma 13. In the ab o ve lemma, we also u sed the discussions in Remarks 12 an d 15 of Hsu et al. (20 14 ) whic h include comments on the sim p liﬁcat ion of the general result. Remark 7 (Optimal λ ) . In addition, along the discussion in R emark 15 of Hsu et al. (2014 ), by cho osing the optimal λ > 0 that minimizes the b ound in the ab ove lemma, we have k ˆ β λ − β k 2 Σ ˆ h ≤ O  k k β k 2 n  + O r 2 k k β k 2 n  E [ b ( ˆ h ) 2 ] + σ 2 noise / 2  ! . D Pro of of Theorem 5 Before we pro vide the pro of, we ﬁrst state the details of b ound on C f . W e requ ir e C f ≤ min    ˜ O   1 r  1 √ k + δ 1  − 1 1 q E [ kS 3 ( x ) k 2 ] · ˜ λ 2 min ˜ λ 2 max · λ min · s 3 min ( A 1 ) s max ( A 1 ) · ˜ ǫ 1   , (60) ˜ O   1 r  1 √ k + δ 1  − 1 1 q E [ kS 3 ( x ) k 2 ] · λ min ˜ λ min ˜ λ max ! 1 . 5 s 3 min ( A 1 ) · 1 √ k   , O 1 r  1 √ k + δ 1  − 1 E [ kS 3 ( x ) k 2 ] 1 / 4 E [ kS 2 ( x ) k 2 ] 3 / 4 · ˜ λ min · s 1 . 5 min ( A 1 ) !) . Pro of of Theorem 5: W e ﬁrst argue that the p erturbation inv olves b oth estimat ion and appro ximation parts. 44 P erturbation decomp osition into appro ximat ion and estimation parts: Similar to the estimation part analysis, we need to ensur e the p erturbation from exact means is small en ough to apply the analysis of Lemmas 9 and 11. Here, in addition to the empirical esti m at ion of quan tities (estimation error), the app ro ximation error also con tributes to the p erturbation. Th is is b ecause there is no realizable setting her e, and th e observ ations are f r om an arbitrary fun cti on f ( x ). W e address this for b oth the tensor decomp osition and the F ourier parts as follo ws. Recall that w e use n otation ˜ f ( x ) (and ˜ y ) to den ot e the outpu t of a neural netw ork. F or arbitrary function f ( x ), w e r efer to the neur al netw ork sa tisfy in g the appr o ximation error pro vided in Theorem 4 b y ˜ y f . The ultimate goal of our analysis is to show that NN-LIFT reco v ers the parameters of this sp eciﬁc neur al net work with small error. More precisely , note that these are a class of n eural net works satisfying the appro ximation b oun d in Th eo r em 5, and it su ﬃ ce s to sa y that the output of the algorithm is close enough to one of them. T ensor decomp osition: There are t wo p erturbation sources in the tensor analysis. O ne is from the approximat ion part and the other is fr om the estimation part. By Lemma 6, the CP decomp osition form is give n b y ˜ T f = E [ ˜ y f · S 3 ( x )], and thus, the p ertur batio n tensor is w r itte n as E := ˜ T f − b T = E [ ˜ y f · S 3 ( x )] − 1 n X i ∈ [ n ] y i · S 3 ( x i ) , where b T = 1 n P i ∈ [ n ] y i · S 3 ( x i ) is the empirical form used in NN-LIFT Algorithm 1. Note th at the observ ations are f rom the arbitrary f u nction y = f ( x ). Th e p ertur batio n tensor can b e exp an d ed as E = E [ ˜ y f · S 3 ( x )] − E [ y · S 3 ( x )] | {z } := E apx . + E [ y · S 3 ( x )] − 1 n X i ∈ [ n ] y i · S 3 ( x i ) | {z } := E est. , where E ap x. and E est. resp ectiv ely denote the p erturbations from appro ximation and estimation parts. W e also desire to use the exact second order momen t ˜ M 2 ,f = E [ ˜ y f · S 2 ( x )] for th e whitening Pro cedure 5 in the tensor deco m position metho d. But, we ha v e an empir ica l version for whic h the p erturbation matrix E 2 := ˜ M 2 ,f − c M 2 is expanded as E 2 = E [ ˜ y f · S 2 ( x )] − E [ y · S 2 ( x )] | {z } := E 2 , ap x. + E [ y · S 2 ( x )] − 1 n X i ∈ [ n ] y i · S 2 ( x i ) | {z } := E 2 , est. , where E 2 , ap x. and E 2 , est. resp ectiv ely denote the p erturbations from ap p ro ximation and estimat ion parts. In Theorem 3 where there is no ap p ro ximation error, w e only n ee d to analyze the estimation p erturbations c haracterized in (33) and (34) since th e neural n et w ork outpu t is directly observed (and th us , we u se ˜ y to denote th e output). Now, the goal is to argue that th e norm of p ertur batio n s E and E 2 are sm al l enough (see Lemm a 10), ensu ring th e tensor p o w er iteration reco v ers the rank-1 comp onen ts of ˜ T f = E [ ˜ y f · S 3 ( x )] with b oun ded err or. Again r ec all from Lemma 6 th at the rank-1 comp onen ts of tensor ˜ T f = E [ ˜ y f · S 3 ( x )] are th e desired comp onen ts to reco ve r . 45 The estimation p erturbations E est. and E 2 , est. are similarly b ounded as in L emm a 9 (see (35) and (36)), and thus, w e h a ve w.h.p . k E est. k ≤ ˜ O  y max √ n q E    M 3 ( x ) M ⊤ 3 ( x )     , k E 2 , est. k ≤ ˜ O  y max √ n q E    S 2 ( x ) S ⊤ 2 ( x )     , where M 3 ( x ) ∈ R d × d 2 denotes the matriciz ation of score function tensor S 3 ( x ) ∈ R d × d × d , and y max is the b oun d on | f ( x ) | = | y | . The norm of approxima tion p erturbation E ap x. := E [( ˜ y f − y ) · S 3 ( x )] is b ound ed as k E ap x. k = k E [( ˜ y f − y ) · S 3 ( x )] k ≤ E [ k ( ˜ y f − y ) · S 3 ( x ) k ] = E [ | ˜ y f − y | · kS 3 ( x ) k ] ≤  E [ | ˜ y f − y | 2 ] · E [ kS 3 ( x ) k 2 ]  1 / 2 , where the ﬁrst in equ al ity is from the Jensen’s inequalit y applied to conv ex n orm function, and w e used Cauc h y-Sch w artz in the last inequalit y . Applying the appro ximation b ound in Theorem 4, w e ha ve k E ap x. k ≤ O ( r C f ) ·  1 √ k + δ 1  · q E [ kS 3 ( x ) k 2 ] , (61) and similarly , k E 2 , ap x. k ≤ O ( r C f ) ·  1 √ k + δ 1  · q E [ kS 2 ( x ) k 2 ] , W e now need to ensu re the o ve r all p erturbations E = E est. + E ap x. and E 2 = E 2 , est. + E 2 , ap x. satisﬁes the requ ired b ounds in Lemma 10. Note that similar to wh at we do in Lemma 9, w e ﬁ rst imp ose a b ound s u c h that the term in vo lvin g k E k is domin ant in (46). Bounding the estimation part k E est. k pro vid es similar sample complexit y as in estimation Lemma 9 with ˜ y max substituted b y y max . F or the app ro ximation error, by imp osing (third b ound stated in the theorem) C f ≤ O 1 r  1 √ k + δ 1  − 1 E [ kS 3 ( x ) k 2 ] 1 / 4 E [ kS 2 ( x ) k 2 ] 3 / 4 · ˜ λ min · s 1 . 5 min ( A 1 ) ! , w e ensure that the term in volving k E k is domin ant in the ﬁn al reco very err or in (46). B y doing this, we do not n eed to imp ose the b ound on k E 2 , ap x. k anymore, and applying the b ound in (61) to the required b ound on k E k in Lemma 10 leads to b oun d (second b ound stated in the theorem) C f ≤ ˜ O   1 r  1 √ k + δ 1  − 1 1 q E [ kS 3 ( x ) k 2 ] · λ min ˜ λ min ˜ λ max ! 1 . 5 s 3 min ( A 1 ) · 1 √ k   . 46 Finally , applying the resu lt of Lemma 10, we ha ve the column -w ise error guaran tees (up to p erm u - tation) k ( A 1 ) j − ( ˆ A 1 ) j k ≤ ˜ O s max ( A 1 ) λ min ˜ λ 2 max p ˜ λ min k E est. k + k E ap x. k ˜ λ 1 . 5 min · s 3 min ( A 1 ) ! , ≤ ˜ O ˜ λ 2 max ˜ λ 2 min s max ( A 1 ) λ min · s 3 min ( A 1 )  y max √ n q E    M 3 ( x ) M ⊤ 3 ( x )    + r C f ·  1 √ k + δ 1  · q E [ kS 3 ( x ) k 2 ]  ≤ ˜ O (˜ ǫ 1 ) , where in the second in equalit y w e substituted the earlier b ound s on k E est. k and k E ap x. k , and th e ﬁrst b ound s on n and C f stated in the theorem are used in the last inequalit y . F ourier part: L et ˜ v f := 1 n X i ∈ [ n ] ( ˜ y f ) i p ( x i ) e − j h ω i ,x i i . Note that this a r ea lization of ˜ v deﬁned in (51) wh en the output is generated by a neural netw ork satisfying appro x im ation error pro vided in Theorem 4 d enote d b y ˜ y f ; s ee the d iscussion in the b eginning of the pro of. The p erturb at ion is no w e := E [ ˜ v f ] − 1 n X i ∈ [ n ] y i p ( x i ) e − j h ω i ,x i i | {z } =: v . Similar to the tensor d ec omp osition part, it can b e expanded to estimati on and appro ximation parts as e := E [ ˜ v f ] − E [ v ] | {z } e apx . + E [ v ] − v | {z } e est. . Similar to Lemma 11, the estimation err or is w.h.p. b ounded as | e est. | ≤ O (˜ ǫ 2 ) , if the sample complexit y satisﬁes n ≥ ˜ O  ζ f ψ ˜ ǫ 2 2  , where ζ f := R R d f ( x ) 2 dx . Notice the diﬀerence b et w een ζ f and ˜ ζ ˜ f . The appr o ximation part is also b ounded as | e ap x. | ≤ 1 ψ E [ | ˜ y f − y | ] ≤ 1 ψ q E [ | ˜ y f − y | 2 ] ≤ 1 ψ O ( rC f ) ·  1 √ k + δ 1  , where the last inequalit y is from the app ro ximation b oun d in Theorem 4. Imp osing the cond iti on C f ≤ 1 r  1 √ k + δ 1  − 1 · O ( ψ ˜ ǫ 2 ) (62) satisﬁes the d esired b ound | e ap x. | ≤ O (˜ ǫ 2 ). The rest of the analysis is the same as Lemma 11. 47 Ridge regression: It introdu ce s an additional appr o ximation term in the linear regression for- m ulated in (56). Given the ab o v e b ound s on C f , the new approxi m at ion term only con tribu tes to lo w er order terms. Com bin ing the analyzes for tensor decomp ositio n , F ourier and ridge regression parts ﬁn ishes the pr oof.  D.1 Discussion on Corollary 1 Similar to th e sp eciﬁc Gaussian kernel function, w e can also pro vid e the results for other ke r nel functions and in general for p ositiv e deﬁnite fu n ctio n s as follo ws. f ( x ) is said to b e p ositiv e deﬁnite if P j,l x i x j f ( x j − x l ) ≥ 0 , for all x j , x l ∈ R d . Barron (1993) sh o ws that p ositiv e deﬁn ite functions ha ve C f b ounded as C f ≤ p − f (0) · ∇ 2 f (0) , where ∇ 2 f ( x ) := P i ∈ [ d ] ∂ 2 f ( x ) /∂ x 2 i . Note that the op erator ∇ 2 is diﬀerent fr om the deriv ativ e op erator ∇ (2) that w e d eﬁned in (4). Applying this to the prop osed b oun d in (18), w e conclude that our algorithm can train a neur al net wo r k which approxi m at es a class of p ositiv e deﬁnite and k ernel functions with similar b ounds as in T h eorem 5. Corollary 1 is for the sp ecial case of Gaussian k ernel function. Pro of of Corollary 1: F or the lo catio n and scale mixture f ( x ) := R K ( α ( x + β )) G ( dα, dβ ), we ha ve (Barron, 1993) C f ≤ C K · R | α | · | G | ( dα, dβ ) , where C K denotes the corresp onding parameter for K ( x ). F or the standard Gaussian kernel function K ( x ) considered here, w e ha ve (Barron, 1993) C K ≤ √ d , which concludes C f ≤ √ d · Z | α | · | G | ( dα, dβ ) . W e no w app ly the requir ed b ound in (18 ) to ﬁ nish the p r oof. But, in this sp eciﬁc setting, we also ha ve the follo wing simp liﬁcat ions for th e b ound in (18). F or the Gaussian input x ∼ N (0 , σ 2 x I d ), the score fun ctio n is S 3 ( x ) = 1 σ 6 x x ⊗ 3 − 1 σ 4 x X j ∈ [ d ] ( x ⊗ e j ⊗ e j + e j ⊗ x ⊗ e j + e j ⊗ e j ⊗ x ) , whic h has exp ected squ are norm as E [ kS 3 ( x ) k 2 ] = ˜ O ( d 3 /σ 6 x ). Giv en the inp ut is Gaussian and the activ ating f unction is the step fun cti on, we ca n also write the co eﬃcien ts λ j and ˜ λ j as λ j = a 2 ( j ) · 1 √ 2 π σ 3 x · exp  − b 1 ( j ) 2 2 σ 2 x  ·  b 1 ( j ) 2 σ 2 x − 1  , ˜ λ j = a 2 ( j ) · b 1 ( j ) √ 2 π σ 3 x · exp  − b 1 ( j ) 2 2 σ 2 x  . Giv en the b ound s on co eﬃcients as | b 1 ( j ) | ≤ 1, | a 2 ( j ) | ≤ 2 C f , j ∈ [ k ], we ha ve ˜ λ min ˜ λ max ≥ ( a 2 ) min · ( b 1 ) min 2 C f exp( − 1 / (2 σ 2 x )) , λ min ≥ ( a 2 ) min √ 2 π σ 3 x exp( − 1 / (2 σ 2 x )) · min j ∈ [ k ] | b 1 ( j ) 2 /σ 2 x − 1 | . 48 Recall that the columns of A 1 are rand omly drawn from the F ourier sp ectrum of f ( x ) as d e- scrib ed in (15). Giv en f ( x ) is th e Gaussian k ernel, the F ourier sp ectrum k ω k·| F ( ω ) | corresp ond s to a sub-gaussain d istribution. T h us, the singular v alues of A 1 are b ound ed as (Rud elso n and V ershynin, 2009) s min ( A 1 ) s max ( A 1 ) ≥ 1 − p k /d 1 + p k /d ≥ O (1) , where the last inequalit y is from k = C d for s ome small en ou gh C < 1. Substituting th ese b ounds in the r equ ired b ound in (18) ﬁn ish es the pro of.  References A. Aga r w al, A. Anand kumar, P . Jain, P . Netrapalli, and R. T andon. Learning Sparsely Used Ov ercomplete Dictionaries. In Confer enc e on L e arning The ory (COL T) , June 2014. Guillaume Alain and Y oshua Bengio. Wh at regularized auto-enco ders learn from th e data gener- ating distribu tio n . arXiv pr eprint arXiv:1211.42 46 , 2012. A. Anandkumar, R. Ge, D. Hsu, and S . M. Kak ade. A T en s or Sp ectral Approac h to Learning Mixed Membersh ip Comm un it y Models. In Confer enc e on L e arning The ory (COL T) , June 2013. Anima An an d kumar, Rong Ge, and Ma jid Janzamin. Samp le Complexit y Analysis for Learning Ov ercomplete Laten t V ariable Mo dels through T ensor Metho ds. arXiv pr eprint arXiv:140 8.0553 , Aug. 2014a. Animashree An andkumar, Rong Ge, Daniel Hsu, S ham M. Kak ade, and Matus T elgarsky . T ensor decomp ositions for learning laten t v ariable mo dels. Journal of M achine L e arning R ese ar ch , 15: 2773– 2832, 2014b. Animashree An andkumar, Rong Ge, and Ma jid Janzamin. Guarant eed Non-Orthogonal T en s or Decomp osit ion v ia Alternating Rank-1 Up dates. arXiv pr eprint arXiv:1402.5180 , F eb. 2014c. Animashree Anand k u mar, Rong Ge, and Ma jid Janzamin. Learning Overcomplete Latent V aria b le Mo dels th r ough T ensor Me tho ds. In Pr o c e e dings of the Confer enc e on L e arning The ory (COL T) , P aris, F rance, Ju ly 2015. Alexandr And oni, Rina P anigrahy , Gregory V alian t, and Li Zhang. Learn ing p olynomials with neu- ral net works. In Pr o c e e dings of the 31st International Confer enc e on M achine L e arning (ICM L- 14) , pages 1908– 1916, 2014 . Martin An thony and P eter L Bartlett . Neur al network le arning: The or e tic al foundatio ns . cam b ridge unive r sit y press, 2009. Sanjeev Ar ora, Adit ya Bh ask ara, Rong Ge, and T engyu Ma. Prov able b ound s for learning some deep representati ons . arXiv pr eprint arXiv:1310.6343 , 2013. An tonio Au ﬃnger, Gerard Ben Arous, et al. Comp le x ity of random smo oth functions on the high-dimensional sph ere. The Annals of Pr ob ability , 41(6):4214 –4247, 2013. 49 Pierre Baldi and Ku r t Hornik. Neural net works and prin cipal comp onen t analysis: Learning from examples without lo cal minima. Neur al networks , 2(1):53–5 8, 1989. Andrew R. Barron. Univ ersal appr o ximation b ounds for sup erp ositions of a sigmoidal function. IEEE T r ansactions on Inf ormation The ory , 39(3):930 –945, Ma y 1993. Andrew R Barron. Appr o ximation and estimation b ounds for artiﬁ cial neural n et works. Machine L e arning , 14:115– 133, 1994. P eter Bartlett and Shai Ben-Da vid. Hardness results for neural net wo r k approxi m at ion p roblems. In Computational L e arning The ory , pages 50–62. Springer, 1999. P eter L Bartlett. The sample complexit y of pattern classiﬁcation w ith n eural n et works: the size of the wei ghts is more imp ortan t than the size of the net w ork. Information The ory, IEEE T r ansactions on , 44(2):525– 536, 1998. A. Bhask ara, M. Charik ar, A. Moitra, a n d A. V ij ay aragha v an. Smo othed analysis o f tensor d ecom- p ositions. arXiv pr eprint arXiv:1311.3651 , 2013. Avrim L Blum and Ronald L Riv est. T raining a 3 -no de neu ral netw ork is n p-co m plete . In Machine le arning: F r om the ory to applic ations , pages 9–28. Sprin ger, 1993. Martin L Brady , Ragh u Ragha v an, and Joseph S la w ny . Bac k prop agation fails to separate where p erceptrons succeed. Cir cu i ts and Systems, IEEE T r ansactions on , 36(5):665– 674, 198 9. V enk at C h andrasek aran, P ablo P arrilo, Alan S Willsky , et al. Latent v ariable graphical mo del selection via con v ex optimization. In Com muni c ation, Contr ol, and Computing (Al lerton), 2010 48th Annual Al lerton Confer enc e on , pages 1610–161 3. IE EE, 201 0. Anna Ch oromansk a, Mik ael Henaﬀ, Mic ha ¨ el Mathieu, G´ erard Ben Arous, and Y an n LeCun . The loss sur face of m ultila yer netw orks. In Pr o c. of 18th Intl. Conf. on Artiﬁcial Intel ligenc e and Statistics (AIST A TS) , 2015 . G. Cyb enk o. appr oximati on by sup erp ositions of a sigmoidal fun ct ion. Mathematics of Contr ol, Signals, and Systems , 2:303– 314, 1989a. George Cyb enk o. Appro ximation by su p erp ositions of a sigmoidal function. Mathemat i c s of c ontr ol, signals and systems , 2(4):30 3–314, 1989b. Y ann N Da u phin, Razv an Pasca nu, Caglar Gulcehre, Kyunghyun Ch o, Surya Ganguli, and Y osh ua Bengio. Ident ifyin g and attac king the saddle p oin t problem in high-dimensional non-con ve x optimization. In A dvanc e s in Neur al Information Pr o c essing Systems , pages 2933–2941 , 2014 . P F rasconi, M Gori, and A T esi. Su cce ss es and failures of backpropagat ion: A theoretical inv esti- gation. Pr o gr ess i n N eur al Networks: A r chite ctur e , 5:205, 1997. Marco Gori and Alb erto T esi. On the prob lem of lo cal minima in bac kp ropaga tion. IEEE T r ans- actions on P att e rn Analysis and M ach i ne Intel lige nc e , 14(1):76–86 , 199 2. Benjamin D. Haeﬀele and R en´ e Vi d al. Global optimalit y in tensor factorization, deep learning, a n d b ey ond. CoRR , abs /1 506.07540 , 2015 . 50 Moritz Hardt, Benjamin Rec ht, and Y oram Singer. T rain faster, generalize b etter: Stability of sto c hastic gradient descen t. arXiv pr eprint arXiv:1509.01 240 , 2015. Geoﬀrey E Hin ton, Nitish S riv asta v a, Alex K rizhevsky , Ily a Sutske ver, and Ru slan R Salakh u tdino v. Improvi n g neural n etw orks b y pr ev en ting co-adaptation of feature d etectors. arXiv pr eprint arXiv:1207.05 80 , 2012. K. Hornik, M. Stinchcom b e, and H. White. multi lay er feedforw ard n et works are un iv ersal appro x- imators. Neur al Networks , 2:359–366, 1989. Kurt Hornik. Appr o ximation capabilities of multila yer feedforwa r d net works. Neur al networks , 4 (2):25 1–257, 1991. Daniel Hsu, Sh am M Kak ade, and T ong Zh ang. Random design analysis of ridge regression. F oundations of Computational Mathematics , 14(3):569– 600, 2014. Aap o Hyv¨ arinen. Estimation of non-normalized statistica l mo dels by s co r e matc hing. In J ournal of Machine L e arning R ese ar ch , p ag es 695–709 , 2005. Ma jid Janzamin, Hanie Sedghi, and Anima Anandkumar. Score F unction F eatures for Discrimina- tiv e Learning: Matrix and T ensor F ramew orks. arXiv pr e pr int arXiv:1412.2863 , Dec. 2014. Mic hael J Kearns and Umesh Virkumar V azirani. An i ntr o duction to c omputationa l le arning the ory . MIT press, 1994. Pra vesh Kothari and R aghu Mek a. Almost optimal p seudorandom generators for sp herical caps. arXiv pr eprint arXiv:1411.6299 , 2014. Christian Kuhlmann. Hard ness resu lts for general tw o-la y er neural net works. In P r o c. of COL T , pages 275–285 , 2000. Roi Livni, Shai S halev-Sh w artz, and Ohad Shamir. On the compu tational eﬃciency of training neural netw orks. In A dvanc es in Neur al Information Pr o c essing Systems , pages 855–863 , 2014. Rob ert J Marks I I and Pa yman Ar abshahi. F ourier analysis and ﬁltering of a single h idden la yer p erceptron. In International Confer enc e on A rtiﬁcial Neur al Networks (IEEE/ENNS), Sorr ento, Italy , 1994. Nelson M organ and Herv´ e Bourlard. Generalization a n d parameter e stimation in feedforward nets: Some exp erimen ts. In NIPS , p ag es 630–6 37, 1989. Ra ´ ul Ro jas. Neu r al networks: a systematic intr o duction . S pringer S ci en ce & Business Media, 1996. Mark Rudelson and Roman V ershynin. S malle st singular v alue of a random rectangular matrix. Communic ations on P ur e and A pp lie d Mathematics , 62(12):1707 –1739, 2009. Hanie Sedghi and Anima Anand k u mar. Pro v able methods for training neur al net wo r ks with sp ars e connectivit y . NIPS worksho p on De ep L e arning and R epr esentation L e arning , Dec. 2014. Shai Shalev-Shw artz and Shai Ben-Da vid. U nderst anding Machine L e arning: F r om The ory to Algor ithms . Cam bridge Universit y Press, 2014. 51 Ji ˇ r ´ ı ˇ S ´ ıma. T raining a single sigmoidal neuron is hard. Neur al Com putation , 14( 11):2709–2 728, 2002. L. Son g, A. Anandkumar, B. Dai, and B. Xie. Nonparametric estimati on of multi-view laten t v ariable mo dels. arXiv pr eprint arXiv:1311.3287 , No v. 2013. Bharath Srip erumbudur, Kenji F ukumizu, Rev an t Kumar, Arthur Gretton, and Aap o Hyv¨ arinen. Densit y estimation in inﬁnite dimensional exp onen tial families. arXiv pr eprint arXiv: 1312.3516 , 2013. Kevin Sw ersky , Da vid Buchman, Nando D F reitas, Benjamin M Marlin, et al. On auto enco ders and score matc hing for energy based mo dels. In Pr o c e e dings of the 28th International Confer enc e on Machine L e arning (ICML- 11) , pages 1201–120 8, 2011. Martin J W ain wright and Mic h ae l I Jord an. Graphical m o dels, exp onen tial families, and v ariational inference. F oundations and T r ends R  in Machine L e arning , 1(1-2):1–30 5, 2008. Yining W ang, Hsiao-Y u T un g, Alexander Smola, and Animash ree Anandkumar. F ast and guaran- teed tensor d ec omp osition via ske tching. In Pr o c . of NIPS , 2015. T. Zhang and G. G olub . Rank-one appro ximation to high order tensors. SIA M Journal on Matrix Ana lysis and Applic ations , 23:534–550 , 200 1. Y uc hen Zh ang, Jason D. Lee, and Mic h ael I. Jordan. ℓ 1 -regularized neur al net wo r ks are improp erly learnable in p olynomial time. CoRR , abs/1510.035 28, 2015. 52

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment