No bad local minima: Data independent training error guarantees for multilayer neural networks

No bad local minima: Data independent training err or guarantee s f or multilayer neural networks Daniel Soudry Departmen t of Statis tics Columbia Univ ersity New Y ork, NY 1002 7, US A danie l.sou dry@gmail.com Y a ir Carmon Departmen t of Electrical Engineering Stanford University Stanford, CA 9430 5, USA yairc @stan ford.edu Abstract W e use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neu ral Networks (MNNs) at differentiable local minima. Speciﬁcally , we examine MNNs with piece wise linear activation functions, q uadratic loss and a single output, un der mild over-parametrization. W e prove that for a MNN with o ne hidden layer, the tr aining error is zero at every differentiable local minim um, fo r almost every dataset and d ropou t-like noise realization. W e th en extend th ese re- sults t o the case of more than on e hidden layer . Our theoretical gu arantees assume essentially noth ing on the training d ata, a nd are veriﬁed numer ically . These results suggest why the highly non -conve x loss of su ch MNNs can b e easily o ptimized using local updates (e.g., stochastic gradient descent), as observed empirically . 1 Intr oduction Multilayer Neu ral Networks (MNNs) have achie ved state-of-the -art perfor mances in many areas of machine learning [20]. This success is ty pically achie ved by training complicated models, using a simp le stochastic grad ient descent (SGD) meth od, or one of its variants. Howe ver , SGD is on ly guaran teed to conv erge to critical po ints in wh ich th e gr adient of the expected loss is zero [5], a nd, speciﬁcally , to stable lo cal minima [25] (this is tr ue also for r egular gr adient descent [2 2]). Since loss fun ctions parametrized by MNN weights are no n-conve x, it has long been a my stery why do es SGD work well, rather than con ver ging to “bad ” local minima, where the training error is high (and thus also the test error is high). Previous re sults (sectio n 2) suggest th at the training er ror at a ll lo cal minim a shou ld be low , if the MNNs ha ve extremely wide layers. However , such wide M NNs would also have an extremely large number of p arameters, and ser ious overﬁtting issues. Moreover , cur rent state of the art results are typically achieved b y deep MNNs [13, 19], r ather then wide. Th erefore, we are interested to pr ovide training error guarantee s at a more practical numb er of parameter s. As a com mon rule-o f-the-th umb, a mu ltilayer neu ral network shou ld have at least as many pa ram- eters as training samples, and use re gularization , such as d ropo ut [ 15] to reduce overﬁtting. For example, Ale xnet [19] had 60 million parameter s and was trained using 1.2 million examples. Such over -parametrizatio n regime contin ues in m ore recent works, which achieve state-of-the-ar t perfo r- mance with very deep networks [13]. Th ese networks ar e typically un der-ﬁtting [13], which suggests that the training error is the main bottleneck in further improving performan ce. In this work we focus on MNNs with a sing le ou tput and leaky r ectiﬁed lin ear u nits. W e provid e a guaran tee that the training error is z ero in every differentiab le local minimum ( DLM), under mild over -parametrizatio n, and essentially fo r every data s et. W ith on e hidd en layer (Theorem 4) we show that the train ing error is ze ro in all DLMs, whenev er the number o f weights in the ﬁrst lay er is lar ger then the number of samp les N , i.e. , wh en N ≤ d 0 d 1 , wh ere d l is the wid th of the activ ation l - th layer . For MNNs with L ≥ 3 layers we sho w that, if N ≤ d L − 2 d L − 1 , then con vergence to poten tially bad DLMs (in which the training er ror is no t zero) can be averted by using a small perturb ation to the MNN’ s weig hts and then ﬁxing all the weights except the last two weight layers (Corollary 6). A ke y asp ect o f our app roach is the presence o f a multip licativ e drop out-like noise term in our MNNs mo del. W e fo rmalize the no tion of validity for essentially every dataset by showing that our results hold almost everywhere with respect to th e L ebesgue measure over the data and this n oise term. Th is app roach is c ommon ly used in smoo thed analysis of algor ithms, and often affords great improvements ov er worst-case guarantees ( e.g. , [30]). Intuitively , there m ay be som e rare cases where our results do n ot ho ld, but almost any inﬁnitesimal pertur bation o f the input an d activ ation function s will ﬁx this. Thus, ou r results assume essentially n o structu re o n the input data, and are unique in that sense. 2 Related work At ﬁrst, it may seem hopeless to ﬁn d any training err or guarantee fo r MNNs. Since th e loss of MNNs is high ly non -conve x, with mu ltiple loc al min ima [8], it seems reasonable that optimizatio n with SGD would g et stuck at some bad local minimum. More over , many theoretical hardn ess results (revie wed in [29]) ha ve b een proven for MNNs with one hidden layer . Despite these results, o ne can easily achieve zero tra ining erro r [3, 2 4], if the MNN’ s last h idden layer has mor e units than training samp les ( d l ≥ N ). This case is no t very useful, since it results in a hu ge nu mber of weig hts (larger than d l − 1 N ) , lead ing to strong over -ﬁtting. Ho we ver , s uch wide networks are easy to op timize, since b y training the last lay er we get to a global minimum (zero training error) from almost ev ery random initialization [11, 16, 23]. Qualitativ ely similar training dynamics are observed also in more standard (narrower) MNNs. Speciﬁcally , the training error usually descends on a single smooth slope path with no “barriers”[9], and the training error at local minim a seems to b e similar to the error at the g lobal minimu m [7]. The latter was explained in [7] by an analogy with high-dimensional random Gaussian functions, in which any critical p oint h igh above the global minimum h as a lo w prob ability to be a local minim um. A different explanation to the same phenomen on was suggested by [6]. There, a MNN was mapped to a spin- glass Ising mod el, in whic h all local min ima are limited to a ﬁnite b and above the g lobal minimum. Howe ver , it is not yet clear ho w relev ant th ese statis tical mechanics results are for actual MNNs and realistic datasets. First, th e ana logy in [7] is qu alitativ e, and the map ping in [6] requ ires se veral im- plausible assumptions ( e .g. , independ ence of inputs and targets). Second, such statistical mechanics results become exact in the limit of in ﬁnite parameter s, so fo r a ﬁnite n umber of lay ers, each layer should be inﬁnitely wid e. Howe ver , extremely wide networks may have serious over-ﬁtting issues, as we explained before. Previous works have sho wn that, g iv en several limiting assumptions o n the dataset, it is possible to get a low tr aining error on a MNN with one hidde n layer : [10] proved co n vergences for linearly separable d atasets; [27] either required that d 0 > N , o r clu stering o f the classes. Going beyond training error, [2] showed that MNNs with one hidden layer can learn lo w order polynomials, under a produ ct of Gaussians distributional assumption on th e input. Also, [17] devised a ten sor method , instead of the standard SGD method, for which MNNs with one hidd en layer are guaranteed to approx imate arbitr ary function s. Note, howe ver , the last two works require a r ather large N to get good guaran tees. 3 Pr eliminaries Model. W e examine a Multilay er Ne ural Network (MNN) optimized on a ﬁnite training set  x ( n ) , y ( n )  N n =1 , whe re X ,  x (1) , . . . , x ( N )  ∈ R d 0 × N are the inp ut patterns,  y (1) , . . . , y ( N )  ∈ R 1 × N are the target outp uts ( for simplicity we assume a scalar ou tput), and N is the n umber of samples. Th e MNN has L layers, in which th e layer inpu ts u ( n ) l ∈ R d l and outp uts v ( n ) l ∈ R d l (a compon ent of v l is denoted v i,l ) are gi ven by ∀ n, ∀ l ≥ 1 : u ( n ) l , W l v ( n ) l − 1 ; v ( n ) l , diag  a ( n ) l  u ( n ) l (3.1) 2 where v ( n ) 0 = x ( n ) is the inpu t o f the network, W l ∈ R d l × d l − 1 are the weight matrice s (a compon ent of W l is denoted W ij,l , bias t erms are ig nored for simplicity), and a ( n ) l = a ( n ) l ( u ( n ) l ) are piecewise constant acti vation slopes deﬁned belo w . W e set A l , [ a (1) l , . . . , a ( N ) l ] . Activations. Many comm only used p iecewise-linear activation fu nctions ( e.g. , rectiﬁed linear unit, maxout, max-po oling) can be written in th e matrix product fo rm in eq. (3.1). W e consider th e following relationship: ∀ n : a ( n ) L = 1 , ∀ l ≤ L − 1 : a ( n ) i,l ( u ( n ) l ) , ǫ ( n ) i,l · ( 1 , if u ( n ) i,l ≥ 0 s , if u ( n ) i,l < 0 . When E l , [ ǫ (1) l , . . . , ǫ ( N ) l ] = 1 we recover the commo n leaky rectiﬁed linear u nit (leaky ReLU) nonlinear ity , with som e ﬁxed slope s 6 = 0 . T he matrix E l can be v iewed as a re alization of dropo ut noise — in most implemen tations ǫ ( n ) i,l is d istributed on a d iscrete set ( e.g. , { 0 , 1 } ) , but competitive perfor mance is ob tained with contin uous distributions ( e.g. Ga ussian) [31, 32]. Our re sults apply directly to the latter case. The in clusion of E l is the innovati ve part of o ur mo del — by perfor ming smoothed ana lysis jointly on X and ( E 1 , ..., E L − 1 ) we are able to d erive strong trainin g erro r guar- antees. Howe ver , ou r use of dropo ut is purely a pr oof strategy; we never expect dro pout to redu ce the training error in realistic datasets. This is further discussed in sections 6 and 7. Measure-theoretic terminology Throug hout the pap er, we m ake extensi ve use of the ter m ( C 1 , ...., C k ) -almost everywhere, o r a.e. fo r short. This is taken to mean, almost everywhere with respect of the Lebesgue measur e on all of the entries of C 1 , ...., C k . A p roperty hold a.e. with respect to som e measure, if the set of ob jects for whic h it doesn ’t hold has measure 0. In p articular, our resu lts hold with p robab ility 1 whenever ( E 1 , ..., E L − 1 ) is taken to have i.i.d. Gaussian entr ies, and arbitrarily small Gaussian i.i.d. noise is used to smooth the input X . Loss f unction. W e denote e , v L − y as the outpu t error, wher e v L is o utput of the neu ral network with v 0 = x , e =  e (1) , . . . , e ( N )  , an d ˆ E as the em pirical expectation over th e trainin g samp les. W e u se the mean square error, which can be written as one of the following forms MSE , 1 2 ˆ E e 2 = 1 2 N N X n =1  e ( n )  2 = 1 2 N k e k 2 , (3.2) The loss fun ction dep ends on X , ( E 1 , ..., E L − 1 ) , and on the entire weight vector w ,  w ⊤ 1 , . . . , w ⊤ L  ⊤ ∈ R ω , where w l , vec ( W l ) is th e ﬂattened weight matr ix of layer l , an d ω = P L l =1 d l − 1 d l is total numb er of weights. 4 Single Hidden layer MNNs are typically trained by minimizin g the loss over the train ing set, u sing Stochastic Gradient Descent (SGD), or one of its variants ( e.g. , AD AM [18]). I n this section and the next, we guaran tee zero trainin g lo ss in th e comm on case of an over -parametrize d MNN. W e do this by an alyzing the proper ties of dif ferentiable local minima (DLMs) of t he MSE (eq. (3.2)). W e focus on DLMs, sin ce under rath er mild conditio ns [5, 2 5], SGD asymptotica lly conver ges to DLMs of the lo ss (fo r ﬁnite N , a point can be non-differentiable only if ∃ i, l , n such that u ( n ) i,l = 0 ). W e ﬁr st consider a MNN with one hid den layer ( L = 2) . W e start by examining the MSE at a DLM 1 2 ˆ E e 2 = 1 2 ˆ E ( y − W 2 diag ( a 1 ) W 1 x ) 2 = 1 2 ˆ E  y − a ⊤ 1 diag ( w 2 ) W 1 x  2 . (4.1) T o simplify notation , we abso rb the redund ant param eterization of the weights of the seco nd layer into the ﬁrst ˜ W 1 = diag ( w 2 ) W 1 , obtainin g 1 2 ˆ E e 2 = 1 2 ˆ E  y − a ⊤ 1 ˜ W 1 x  2 . (4.2) 3 Note th is is only a simpliﬁed n otation — we do n ot actually change th e weig hts of the MNN, so in both equatio ns the activati on slopes r emain the same, i.e. , a ( n ) 1 = a 1 ( W 1 x ( n ) ) 6 = a 1 ( ˜ W 1 x ( n ) ) . If there exists an inﬁn itesimal perturbation wh ich reduces the MSE in eq. (4.2), then there exists a cor- respond ing inﬁnitesimal perturb ation which reduces the MSE in eq. (4.1). Theref ore, if ( W 1 , W 2 ) is a DL M of the MSE in eq. (4.1), then ˜ W 1 must also b e a DLM of the MSE in eq. (4.2). C learly , both DLMs have the same MSE value. Theref ore, we will pro ceed by assuming that ˜ W 1 is a DLM of e q. (4.2), a nd a ny con straint we will derive for th e MSE in eq. (4.2) will auto matically ap ply to any DLM of the MSE in eq. (4.1). If we are at a DLM of eq. (4.2), th en its deriv ati ve is equ al to z ero. T o c alculate th is d eriv ati ve we rely on two facts. First, we can always switch the ord er of differentiation and e xpectation, since we av erage over a ﬁnite tr aining set. Second, at any a differentiable point (a nd in p articular, a DLM ), the deriv ati ve of a 1 with respect to the weights is zero. Thus, we ﬁnd that, at any DLM, ∇ ˜ W 1 MSE = ˆ E  e a 1 x ⊤  = 0 . (4.3) T o reshap e this gradien t equation to a more co n venient form, we denote Kr onecker’ s product by ⊗ , and deﬁne the “grad ient matrix” (withou t the error e ) G 1 , A 1 ◦ X , h a (1) 1 ⊗ x (1) , . . . , a ( N ) 1 ⊗ x ( N ) i ∈ R d 0 d 1 × N , (4.4) where ◦ denotes the Khatari-Rao product ( cf. [1], [4]). Using this notation, a nd recalling that e =  e (1) , . . . , e ( N )  , eq. (4.3) becom es G 1 e = 0 . (4.5) Therefo re, e lies in the right nullspace of G 1 , which has dimension N − r ank ( G 1 ) . Sp eciﬁcally , if rank ( G 1 ) = N , the only solution is e = 0 . This immedia tely implies the following lemma. Lemma 1. Suppo se we ar e at some DLM of of eq. ( 4.2) . I f rank ( G 1 ) = N , then MSE = 0 . T o sho w that G 1 has, generically , full column rank, we state the following impo rtant result, which a special case of [1, lemma 13], Fact 2. F or B ∈ R d B × N and C ∈ R d C × N with N ≤ d B d C , we have, ( B , C ) almost everywher e, rank ( B ◦ C ) = N . (4.6) Howe ver , since A 1 depend s on X , we cannot apply eq. (4.6) direc tly to G 1 = A 1 ◦ X . In stead, we apply eq. (4.6) for all (ﬁnitely many) possible v alues of sign ( W 1 X ) (append ix A), and obtain Lemma 3. F or L = 2 , if N ≤ d 1 d 0 , then simultaneously for every w , rank ( G 1 ) = rank ( A 1 ◦ X ) = N , ( X , E 1 ) almost everywher e. Combining Lemma 1 with Lemma 3, we immediately hav e Theorem 4. If N ≤ d 1 d 0 , then all d iffer en tiable loca l minima of eq. ( 4.1) a r e globa l minima with MSE = 0 , ( X , E 1 ) almost everywher e. Note that this result is tight, in the sense that th e minimal hidd en layer width d 1 = ⌈ N/d 0 ⌉ , is exactly the same minimal width which ensures a MNN can implement any dich otomy [3] for inputs in general position. 5 Multiple Hidden Layers W e examine the implica tions o f our ap proach for MNNs with more than on e hid den layer . T o ﬁnd the DLMs of a gener al MNN, we again need to differentiate the MSE and eq uate it to zero. As in section 4, we exchange the order of expectation and d ifferentiation, and use the fact that a 1 , ..., a L − 1 are piecewise constan t. Differentiating ne ar a DLM with respect to w l , the vectorized version of W l , we obtain 1 2 ∇ w l ˆ E e 2 = ˆ E [ e ∇ w l e ] = 0 (5.1) 4 T o calculate ∇ w l e for to the l -th weight layer , w e write 1 its input v l and its back-pro pagated “delta” signal (without the error e ) v l , l Y m =1 diag ( a m ) W m ! x ; δ l , diag ( a l ) l +1 Y m = L W ⊤ m diag ( a m ) , (5.2) where we keep in mind that a l are generally f unction s of the in puts and the weights. Using this notation we ﬁnd ∇ w l e = ∇ w l L Y m =1 diag ( a m ) W m ! x = δ ⊤ l ⊗ v ⊤ l − 1 . (5.3) Thus, deﬁning ∆ l = h δ (1) l , . . . , δ ( N ) l i ; V l = h v (1) l , . . . , v ( N ) l i we can re-fo rmulate eq. (5.1) as G l e = 0 , with G l , ∆ l ◦ V l − 1 = h δ (1) l ⊗ v (1) l − 1 , . . . , δ ( N ) l ⊗ v ( N ) l − 1 i ∈ R d l − 1 d l × N (5.4) similarly to eq. (4 .5) th e p revious section . Theref ore, e ach weight layer provides as many linear constraints (rows) as the number of its parameters. W e can also comb ine all t he constra ints and get Ge = 0 , with G ,  G ⊤ 1 , . . . , G ⊤ L  ⊤ ∈ R ω × N , (5.5) In wh ich we have ω constraints ( rows) cor respond ing to all the par ameters in the MNN. As in the previous section , if ω ≥ N a nd rank ( G ) = N we must hav e e = 0 . Howe ver , it is gener ally difﬁcult to ﬁnd the rank of G , since we n eed to ﬁn d whether different G l have linearly d ependen t rows. T herefor e, we will f ocus on the last h idden layer a nd on the c ondition rank ( G L − 1 ) = N , which ensures e = 0 , f rom e q. (5.4). Howe ver , since v L − 2 depend s on the weights, we cannot use our results from the pr evious section , and it is possible that ra nk ( G L − 1 ) < N . For example , when w = 0 an d L ≥ 3 , we get G = 0 so we are at a differentiable critical point (note it G is well deﬁned, even though ∀ l , n : u ( n ) l = 0 ), which is g enerally no t a global minim um. Intuitively , such cases seem fragile, since if we give w any rand om perturb ation, on e would expect tha t “typica lly” we wou ld have ra nk ( G L − 1 ) = N . W e establish this idea by ﬁrst proving the following stronger result (append ix B), Theorem 5. F or N ≤ d L − 2 d L − 1 and ﬁxed valu es of W 1 , ..., W L − 2 , an y differ en tiable local mini- mum of the MSE (eq . 3.2) as a fu nction of W L − 1 and W L , is also a glob al minimum , with MSE = 0 , ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) almost everywher e. Theorem 5 means that fo r any (Lebesgue measurab le) ran dom set of weig hts of the ﬁrst L − 2 layers, ev ery DLM with respect to the weigh ts of the last two layers is also a global m inimum w ith loss 0. Note th at the cond ition N ≤ d L − 2 d L − 1 implies that W L − 1 has more weig hts then N (a plau sible scenario, e.g . , [1 9]). In contrast, if, instead we were only allowed to adjust the last layer of a random MNN, then low training error can only be ensured with extremely wide lay ers ( d L − 1 ≥ N , as discussed in section 2), which requir e much more parameter s ( d L − 2 N ). Theorem 5 can be e asily extended to other types of neu ral networks, beyond of the basic formalism introdu ced in section 3. For example, we can re place the layers b elow L − 2 with conv olutional layers, or o ther ty pes o f architectures. Additionally , the proof of Theorem 5 holds ( with a trivial adjustment) wh en E 1 , ..., E L − 3 are ﬁxed to have identical nonzer o en tries — tha t is, with dro pout turned of f except in the last two hid den layers. T he result continues to hold even when E L − 2 is ﬁxed as well, but then the condition N ≤ d L − 2 d L − 1 has to be weakened to N ≤ d L − 1 min l ≤ L − 2 d l . Next, we formalize our intuitio n above th at DLMs of deep MNNs must have zero loss o r be fragile, in the sense of the following immediate corollar y of Theor em 5, Corollary 6. F or N ≤ d L − 2 d L − 1 , let w b e a differ e ntiable loca l minimum of the MSE (eq. 3.2). Consider a new w eight vector ˜ w = w + δ w , wher e δ w has i.i.d. Gaussian (or uniform) entries with arbitrarily small variance. Then, ( X , E 1 , . . . , E L − 1 ) almost e verywher e an d with pr obability 1 w .r .t. δ w , if ˜ W 1 , ..., ˜ W L − 2 ar e h eld ﬁxed, all differ entiab le lo cal minima of the MSE as a function of W L − 1 and W L ar e also global minima, with MSE = 0 . 1 For matrix products we use the con vention Q K k =1 M k = M K M K − 1 · · · M 2 M 1 . 5 Note that this result is d ifferent fro m the classical n otion of linear stability at d ifferentiable critical points, which is based on the analysis of the eigenv alues of the Hessian H of the MSE. The Hess ian can be written as a symmetric block matrix, wh ere each of its bloc ks H ml ∈ R d m − 1 d m × d l − 1 d l correspo nds to layer s m and l . Sp eciﬁcally , u sing eq. (5.3), each b lock can be written as a sum o f two components H ml , 1 2 ∇ w l ∇ w ⊤ m ˆ E e 2 = ˆ E  e ∇ w l ∇ w ⊤ m e  + ˆ E  ∇ w l e ∇ w ⊤ m e  , ˆ E [ e Λ ml ] + 1 N G m G ⊤ l , (5.6) where, for l < m Λ ml , ∇ w l ∇ w ⊤ m e = ∇ w l ( δ m ⊗ v m − 1 ) = δ m ⊗ m − 1 Y l ′ = l +1 diag ( a l ′ ) W l ′ ! diag ( a l ) ⊗ v ⊤ l − 1 (5.7) while Λ ll = 0 , and Λ ml = Λ ⊤ lm for m < l . Co mbining all the blocks, we get H = ˆ E [ e Λ ] + 1 N GG ⊤ ∈ R ω × ω . If we are at a DLM, the n H is positive semi-deﬁnite. I f we examine again the differentiable critical point w = 0 and L ≥ 3 , we see that H = 0 , so it is not a strict saddle. Howe ver , this point is fragile in the sense of Corollary 6. Interestingly , the p ositiv e semi-deﬁnite natur e of the Hessian at DLM s imposes additional con- straints on the error . Note that th e m atrix GG ⊤ is symmetric po siti ve semi-deﬁn ite of relatively small rank ( ≤ N ) . Ho we ver , ˆ E [ e Λ ] can p otentially be of high ran k, and thus may h av e many ne ga- ti ve eig env alues (th e trace of ˆ E [ e Λ ] is zero, so the sum of a ll its eigenv alues is also z ero). Th erefore, intuitively , we e xpect that for H to be positi ve sem i-deﬁnite, e has to become small, generically ( i.e. , except at some pathological points such as w = 0 ). This is indeed observed empirically [7, Fig 1]. 6 Numerical Experiments In this section we examine numerically our main resu lts in this p aper, The orems 4 an d 5, which hold almost e verywhere with respect to the Lebesgue measure over the data and drop out realization. Howe ver , withou t dropo ut, th is analysis is n ot guaran teed to hold. For exam ple, our results do not hold in MNNs where all the weights are negati ve, so A L − 1 has constant entries and therefore G L − 1 cannot have full rank. Nonetheless, if the a ctiv ations are suf ﬁciently “variable” (formally , G L − 1 has f ull rank ), then we expect o ur results to hold e ven withou t d ropou t noise and with the leaky ReLU’ s r eplaced with basic ReLU’ s ( s = 0 ). W e tested this numerically an d present the result in Figure 6. 1. W e performed a binary classiﬁcation task on a synthetic ran dom dataset and subsets o f th e M NIST dataset, and show the mean classiﬁcation error (M CE, wh ich is the fraction of samples incorrectly classiﬁed), common ly used at these tasks. Note th at the MNIST dataset, which contains some redu ndant in- formation between training sam ples, is much easier ( a lower er ror) than the com pletely rand om synthetic d ata. Thus the perfo rmance on the random data is more repre sentativ e of the “typ ical worst case”, ( i.e. , hard yet non-path ological in put), which our smooth ed analysis approach is aimed to uncover . For o ne hidden layer , the err or goes to zero when the n umber of non- redund ant param eters is greater than the num ber of samples ( d 2 / N ≥ 1 ) , as predicted by Th eorem 4. Theorem 5 p redicts a similar behavior when d 2 / N ≥ 1 fo r a MNN with two hid den layers ( note we traine d all the layers of th e MNN). Th is pre diction also seem s to hold, but less tightly . This is reasonable, as our analy sis in section 5 sugg ests th at typically the er ror would b e zero if the total numb er of p arameters is larger the numb er o f training samples ( d 2 / N ≥ 0 . 5 ), tho ugh this was no t proven. W e n ote that in all the repetitions in Figure 6. 1, for d 2 ≥ N , the m atrix G L − 1 always had full r ank. Howe ver , for sm aller MNNs than shown in Figure 6.1 (about d ≤ 20 ), sometimes G L − 1 did not ha ve full rank. Recall th at T heorem s 4 a nd 5 both give g uarante es o nly on the training error at a DL M. Howev er , for ﬁnite N , since the loss is non-differentiable at some points, it is not clear that such D LMs actually exist, o r tha t we can co n verge to th em. T o ch eck if th is is ind eed the case, w e p erform ed the following experiment. W e train ed the M NN for m any epochs, using b atch gradient steps. Then , 6 Figure 6.1: Final training error (mean ± std) in the o ver -parametrized r egime is low , as pre- dicted by our results (right of the dashed black line). W e trained standard MN Ns with one or two h idden layers (with widths eq ual to d = d 0 ), a single output, (no n-leaky) ReLU activ ations, MSE loss, and no dropout, on tw o datasets: (1) a synthetic rando m d ataset in which ∀ n = 1 , . . . , N , x ( n ) was drawn from a normal distribution N (0 , 1) , and y ( n ) = ± 1 with pro bability 0 . 5 (2) bina ry classiﬁcation (b etween digits 0 − 4 and 5 − 9 ) on N si zed su bsets of the MNI ST dataset [21]. The value at a data point is an a vera ge of the mean classiﬁcation error (MCE) over 30 repetitions. I n this ﬁgure, when the mean MCE reached zero, it was zero for all 30 repetitions. we started to gr adually decrease the learn ing rate. If the we are at DLM, then all the acti v ation inputs u ( n ) i,l should converge to a distinctly non- zero value, as demonstrated in Figu re 6.2. I n this ﬁgure, we tested a small MNN on syn thetic d ata, an d all the neural in puts seem to remain constant on a non-zero value, while the MSE keeps decreasing. This was the typical case in our experiments. Howe ver , in som e instanc es, we would see som e u ( n ) i,l conv erge to a very low value ( 10 − 16 ). T his may indicate that conver gence to non -differentiable p oints is possible as well. Implementation details W eights were initialized t o be uniform with mean zero and variance 2 /d , as suggested in [14]. I n each epoch we r andom ly perm uted th e dataset and used the Adam [18 ] optimization metho d (a variant of SGD) with β 1 = 0 . 9 , β 2 = 0 . 99 , ε = 10 − 8 . In Figure 6. 1 the training was don e for no more th an 4 000 e pochs (we sto pped if MCE = 0 was reac hed). Dif ferent learning rates and mini-batch sizes were selected for each dataset and architecture. 7 Discussion In this work we provided tra ining er ror guarante es for mildly ov er-parameterized MNNs at all dif fer- entiable local minima (DLM). F or a single hidden layer (section 4), the proof is surprisingly si mple. W e sho w that the MSE n ear ea ch DLM is locally similar to that o f linear r egression ( i.e. , a sing le linear neu ron). This allows us to p rove (Theor em 4) that, almost ev erywhere, if th e numb er of non- redund ant param eters ( d 0 d 1 ) is larger then the nu mber of sam ples N , then all DLMs are a glob al minima with MSE = 0 , as in linear regression . W ith more then one hidden layers, Theorem 5 states that if N ≤ d L − 2 d L − 1 ( i.e. , so W L − 1 has mo re weights than N ) th en we can always p erturb and ﬁx some weights in the MNN so that all the DLMs would again be global minima with MSE = 0 . Note that in a r ealistic setting , zero trainin g err or sho uld not ne cessarily be the inte nded ob jectiv e of training, since it may enco urage overﬁtting. Our ma in goal here was to show that that essentially all DL Ms p rovide goo d training erro r (which is not trivial in a non-co n vex model). However , one can decr ease the size of the model or artiﬁcially increase the number o f samples ( e.g. , u sing da ta 7 Figure 6. 2: The existence o f diffe rentiable local minima. In this representative ﬁgure, we trained a MNN with a single hidden layer, as in Fig . 6.1, with d = 2 5 , on the synthe tic rand om data ( N = 10 0 ) until co n vergence with gradient descent (so ea ch e poch is a gradient step). Then, starting from epo ch 5000 ( dashed line), we gr adually decreased the learning rate (m ultiplying it b y 0 . 9 99 each epo ch) until it w as about 10 − 9 . W e see that the activation inpu ts co n verged to v alues a bove 10 − 5 , while the ﬁnal MSE was about 10 − 31 . The magnitud es of these numbers, and the fact that all the neuron al inputs do no t keep d ecreasing with the learn ing r ate, indicate that we converged to a differentiable local minimum , with MSE equal to 0, as predicted. augmen tation, or re-sampling the d ropo ut noise) to be in a mildly un der-parameterized regime, and have rela ti vely small error , as seen in Figure 6.1. For example, in AlexNet [1 9] W L − 1 has 4096 2 ≈ 17 · 10 6 weights, which is larger than N = 1 . 2 · 10 6 , as required by The orem 5. Ho we ver , without data augmenta tion or dropout, Alexnet did exhibit s ev ere o verﬁtting. Our analysis is non-asymp totic, relying on the f act that, near differentiable po ints, MNNs with piecewise linea r activ ation fu nctions can be differentiated similarly to linear MNNs [28]. W e u se a smooth ed analysis approach , in which we examine the error o f the MNN und er slight random perturb ations of worst-case input a nd dropout. Our experiments ( Figure 6.1) suggest t hat our results describe the typical perfo rmance of MNNs, ev en without drop out. Note we do not claim that d ropou t has any me rit in reducin g the train ing loss in real datasets — a s used in p ractice, dropo ut typically trades off the training p erform ance in fa vor of improved gene ralization. Thus, the role of drop out in our results is pu rely theoretical. In particular, dropout ensures that the gradie nt matrix G L − 1 (eq. (5.4)) has full column r ank. It would b e an intere sting dir ection for f uture work to ﬁnd other sufﬁcient conditions for G L − 1 to have full column rank. Many other directions remain for future w ork. F or example, we believe it shou ld b e possible to extend this work to multi-ou tput MNNs and /or other convex loss fun ctions besides th e quadratic loss. Our results might also be extend ed for stable non-differentiab le c ritical points (which may exist, see section 6) using the n ecessary cond ition that the sub-gradien t set contains zero in any critical po int [ 26]. Another impo rtant d irection is improving th e results of Theorem 5, so it would make efﬁcient use o f the all the par ameters of the MNNs, and not ju st the last two weig ht layer s. Such results might be used as a guideline for architecture d esign, wh en training error is a major bottleneck [1 3]. Last, but no t least, in this work we focused on the empirical risk ( training err or) at DLMs. Such guarantees might be com bined with gener alization guarantees ( e.g. , [12]), to ob tain novel excess risk bounds that go beyond unifor m conver gence analy sis. Acknowledgments The authors are grateful to O. Barak, D. Carmon, Y . Han., Y . Harel, R. Meir, E. Meirom, L. Paninski, R. Rubin, M. Stern, U. Sümbü l and A. W olf fo r helpful discussions. The research was p artially supported by the Gru ss Lipper Cha ritable Foun dation, and by the In telligence Advanced Research Projects Activity ( IARP A) via Departm ent of Interio r/ Interio r Business Center (DoI /IBC) con tract number D16 PC00003. The U.S. Government is authoriz ed to repr oduce a nd distribute reprints fo r Governmental purp oses no twithstanding any copyr ight annotation thereon . Disclaimer: The views and conclu sions c ontained herein are tho se of the auth ors and should no t be interpr eted as necessarily 8 representin g the ofﬁcial policies or endorsemen ts, either expressed or implied, of IARP A, DoI/I BC, or the U.S. Government. Refer ences [1] Elizabeth S. Allman, Catherine Matias, and John A. Rhodes. Id entiﬁability o f parameters in latent structure models with many observed variables. Ann als of Statistics , 37(6 A):3099–313 2, 2009. 4, 4, B [2] A Ando ni, R Panigrahy , G V aliant, a nd L Z hang. Learning Polynomials with Neural Networks. In ICML , 2014 . 2 [3] Eric B. Baum. On th e cap abilities of multilayer perceptro ns. J ournal o f Complexity , 4( 3):193 – 215, 1988. 2, 4 [4] Aditya Bha skara, Moses Charikar , Ank ur Moitr a, and Ar avindan V ijayarag hav an. Smo othed Analysis of T ensor Decomp ositions. ArXiv:1311.36 51 , p age 32, 2013. 4 [5] L Bottou. Online learning and stochastic ap proxim ations. On-line learning in n eural networks , pages 1–34 , 1998. 1, 4 [6] Anna Chorom anska, Mik ael Henaff, M ichael M athieu, Gérar d Ben Arou s, a nd Y LeCun. The Loss Surfaces of Multilayer Networks. AIST ATS15 , 38, 201 5. 2 [7] YN Dauphin, Razv an P ascanu, and Caglar Gulcehre. I dentifyin g a nd attacking the saddle point problem in high-d imensional non-c onv ex optimizatio n. Adva nces in Neural In formation Pr o- cessing Systems , pages 1–9 , 2014. 2, 5 [8] K. Fuku mizu and S. Ama ri. Local min ima and plateaus in hier archical structur es of multilay er perceptro ns. Neur al Networks , 13:317 –327 , 200 0. 2 [9] Ian J. Goodfellow , Oriol V inyals, and And rew M. Saxe. Qualitati vely characterizing neu ral network optimization problems. ICLR , 2015 . 2 [10] Marco Gori and Alberto T esi. On the problem of local minima in backp ropag ation, 199 2. 2 [11] Benjamin D Haef fele and René V idal. Global Optimality in T ensor Factorization, Deep Learn- ing, and Beyond. ArXiv:1506. 0754 0 , (1):7, 2015 . 2 [12] Moritz Hardt, Benjam in Recht, and Y Singer . Train faster, generalize better: Stability of stochastic grad ient descent. ArXiv:1 509.0 1240 , pages 1–24, 2015. 7 [13] K He , X Zhang , S Ren, and J. Sun . Deep Residu al Learning for Image Recognition. ArXiv:151 2.033 85 , 201 5. 1, 7 [14] Kaiming He , Xiang yu Zhan g, Shaoqing Ren, and Jian Sun. Delving Deep into Rectiﬁers: Surpassing Human- Level Performan ce on Imag eNet Classiﬁcation. In In Pr oceedings o f the IEEE Internationa l Confer ence on Computer V isio n , pages 1026 –1034 , 201 5. 6 [15] Geoffrey E Hinton, Nitish Sriv asta va, Alex Krizh evsky , Ily a Su tske ver , an d Ruslan R. Salakhut- dinov . Improvin g neural ne tworks by preventing co -adaptatio n of feature detec tors. ArXiv : 1207. 0580 , pages 1–18 , 2012. 1 [16] Guang-Bin H uang, Qin-Y u Zhu , an d Chee -Kheong Siew . Extreme lea rning m achine: The ory and application s. Neur ocomp uting , 70(1-3) :489–5 01, 2006. 2 [17] M Janzamin, H Sedgh i, and A Anandkumar . Beating the Perils of No n-Conve xity: Guaranteed T raining of Neural Networks using T ensor Method s. ArXiv:1506.0 8473 , p ages 1–25 , 2015. 2 [18] Diederik P Kin gma and Jimmy L ei Ba. Adam: a Method fo r Sto chastic Optim ization. Inter- nationa l Confer en ce on Learning Repr esentations , pages 1–13, 2015 . 4, 6 [19] A Kr izhevsky , I Sutske ver , and G E Hinton. Ima genet classiﬁcation with dee p conv olutional neural networks. In NIPS , 2012. 1, 5, 7 9 [20] Y LeCun , Y osh ua Bengio , an d Geo ffrey Hinton. Deep learning . Nature , 521(7 553):4 36–4 44, 2015. 1 [21] Y LeCun, L Bottou, Y Ben gio, and P Haffner . Grad ient-based learning ap plied to doc ument recogn ition. Pr oceedin gs of the IEEE , 86(11):22 78–2 323, 1998 . 6 .1 [22] Jason D. Lee, Max Simchowitz, M ichael I. Jordan , and Benjamin Recht. Gradient Descent Con verges to Minimize rs. ArXiv:160 2.049 15 , 201 6. 1 [23] Roi Livni, S Shalev-Shwartz, and Ohad Shamir . On the Co mputation al Ef ﬁciency of Training Neural Networks. NIPS , 2014. 2 [24] Nils J. Nilsson. Learning machines . McGraw-Hill Ne w Y ork, 1965. 2 [25] R Pemantle. Nonconver gence to unstab le points in ur n models and stoch astic appro ximations. The Annals of Pr o bability , 18(2) :698–7 12, 199 0. 1, 4 [26] R T Rockafellarf. Directiona lly Lip schitzian Functions and S ubdifferential Calculus. Pr oceed- ings of the Londo n Mathematica l Society , 39(77):33 1–35 5, 19 79. 7 [27] Itay Safran and Ohad Shamir . On the Quality of the I nitial Basin in Overspeciﬁed Neural Networks. ArXiv:151 1.042 10 , 201 5. 2 [28] A M Saxe, J L. McClelland, an d S Ganguli. E xact solutions to the n onlinear dynamics of learning in deep linear neura l netw orks. ICLR , 2014 . 7 [29] Jirí Síma. T raining a single sigmoidal ne uron is hard . Neural compu tation , 1 4(11) :2709– 28, 2002. 2 [30] Daniel A Sp ielman and Shan g-hua T e ng. Smoothed analysis: an attempt to e xplain the be- havior of algorithms in practice. Communications of the ACM (CACM) , 52(10 ):76–8 4, 2 009. 1 [31] Nitish Sriv asta va, Geoffrey E. Hi nton, Alex Krizhevsky , Ilya Sutske ver , and Ruslan Salak hutdi- nov . Drop out : A Simple W ay to Pre vent Neu ral Networks fr om Overﬁtting. Journal of Mac hine Learning Resear ch (JMLR) , 15:192 9–19 58, 201 4. 3 [32] Bing X u, Naiyan W ang, Tianqi Che n, and Mu Li. Empirical Evaluation of Rectiﬁed Acti vations in Conv olution Network. ICML Deep Learning W orkshop , 20 15. 3 10 A pp endix In this append ix we gi ve the pr oofs for our main results in the paper . But ﬁrst, we deﬁne some additional notation . Recall th at for every layer l , data instance n and ind ex i , the activation slope a ( n ) i,l takes one of two v alues: ǫ ( n ) i,l or sǫ ( n ) i,l (with s 6 = 0 ). Hence, for a single realiza tion of E l = h ǫ (1) l , . . . , ǫ ( N ) l i , th e m atrix A l = h a (1) l , ..., a ( N ) l i ∈ R d l × N can have u p to 2 N d l distinct values, and the tuple ( A 1 , ..., A L − 1 ) can ha ve at most P = 2 N P L − 1 l =1 d l distinct v alues. W e will ﬁnd it useful to enu merate these possibilities by an index p ∈ { 1 , ..., P } which will be called the acti v ation pattern. W e w ill similarly denote A p l to be the v alue of A l , gi ven under activ ation pattern p . La stly , we will make use of the following fact: Fact 7. If p r o perties P 1 , P 2 , ..., P m hold almost everywher e, the n ∩ m i =1 P i also holds almo st every- wher e. A Single hidden layer — pr oof of Lemma 3 W e p rove the following Lemma 3 , using the previous notation and results from section (4). Lemma. F or L = 2 , if N ≤ d 1 d 0 , the n simultaneously for every w , rank ( G 1 ) = rank ( A 1 ◦ X ) = N , ( X , E 1 ) almost everywher e. Pr o of. W e ﬁx an activ ation pattern p an d set G p 1 = A p 1 ◦ X . W e apply eq. ( 4.6) to conclude that ra nk ( G p 1 ) = rank ( A p 1 ◦ X ) = N , ( X , A p 1 ) -a.e. and hence also ( X , E 1 ) -a.e.. W e repeat the argument for all 2 N d 1 values of p , and use fact 7. W e conclude that r ank ( G p 1 ) = N for all p simultaneou sly , ( X , E 1 ) -a.e.. Since for ev ery set of weigh ts we have G = G p for some p , we have rank ( G 1 ) = N , ( X , E 1 ) -a.e. B Multiple Hidden Layers — pr oof of theor em 5 First we prove the following helpful Lemma, using a techniqu e similar to that of [1]. Lemma 8. Let M ( θ ) ∈ R a × b be a matrix with a ≥ b , with en tries that ar e a ll polynomial functio ns of some vector θ . Also, we assume that for some value θ 0 , we have rank ( M ( θ 0 )) = b . Then, for almost every θ , we ha ve rank ( M ( θ )) = b . Pr o of. There exists a polyno mial mapping g : R a × b → R su ch that M ( θ ) does not have full column rank if and o nly if g ( M ( θ )) = 0 . Since b ≤ a we can c onstruct g explicitly as th e sum of the squares of th e determinan ts of all possible different subsets of b rows fr om M ( θ ) . Since g ( M ( θ 0 )) 6 = 0 , we ﬁnd t hat g ( M ( θ )) is not identically equal to zero. Th erefor e, the zeros of such a (“pro per”) polyn omial, in which g ( M ( θ )) = 0 , are a set of me asure zero. Next we prove Th eorem 5, using the previous notation and the results from section (5): Theorem. F o r N ≤ d L − 2 d L − 1 and ﬁxed values of W 1 , ..., W L − 2 , a ny d iffer en tiable lo cal mini- mum of the MSE (eq . 3.2) as a fu nction of W L − 1 and W L , is also a glob al minimum , with MSE = 0 , ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) almost everywher e. Pr o of. W ithout loss of generality , assume w L = 1 , s ince we can absorb the weights of the last la yer into the L − 1 weight layer , as we did in sing le hidden layer case (eq. ( 4.2)). Fix an acti v ation pattern p ∈ { 1 , ..., P } as deﬁned in the beginnin g of this appendix . Set v p ( n ) L − 2 , l Y m =1 diag  a p ( n ) m  W m ! x ( n ) , V p L − 2 , h v p (1) L − 2 , . . . , v p ( N ) L − 2 i ∈ R d L − 2 × N (B.1) and G p L − 1 , A p L − 1 ◦ V p L − 2 (B.2) 11 Note that, since the activation pattern is ﬁx ed, th e entries of G p L − 1 are po lynomials in the e ntries of ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) , an d we may ther efore apply Lemma 8 to G p L − 1 . Thus, to establish rank( G p L − 1 ) = N ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) -a.e., we only n eed to exhibit a single ( X ′ , E ′ 1 , . . . , E ′ L − 1 , W ′ 1 , ..., W ′ L − 2 ) fo r which rank( G ′ p L − 1 ) = N . W e no te that for a ﬁxed activ ation p attern, we can obtain an y value of ( A p 1 , ..., A p L − 1 ) with some choice of ( E ′ 1 , . . . , E ′ L − 1 ) , so we will specify ( A ′ p 1 , ..., A ′ p L − 1 ) directly . W e m ake the follo wing choices: x ′ ( n ) i = 1 ∀ i , n , a ′ p ( n ) i,l = 1 ∀ l < L − 2 , ∀ i, n (B.3) W ′ l =  1 d l × 1 , 0 d l × ( d l − 1 − 1)  ∀ l ≤ L − 2 (B.4) A ′ p L − 2 =  1 1 × d L − 1 ⊗ I d L − 2  1 ,...,N , A ′ p L − 1 =  I d L − 1 ⊗ 1 1 × d L − 2  1 ,...,N (B.5) where 1 a × b (respectively , 0 a × b ) deno tes an all ones (zeros) matrix of dimension s a × b , I a denotes the a × a identity m atrix, an d [ M ] 1 ,...,N denotes a matrix com posed o f the ﬁr st N colum ns of M . It is easy to verify that with th is cho ice, we have W ′ L − 2  Q L − 3 m =1 diag  a ′ p ( n ) m  W ′ m  x ′ ( n ) = 1 d L − 2 × 1 for any n , and so V ′ p L − 2 = A ′ p L − 2 and G ′ p L − 1 = A ′ p L − 1 ◦ A ′ p L − 2 =  I d L − 2 d L − 1  1 ,...,N (B.6) which obviously satisﬁes rank( G ′ p L − 1 ) = N . W e conclud e that r ank( G p L − 1 ) = N , ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) -a.e., and r emark this argu ment p roves Fact 2, if we specialize to L = 2 . As we did in the proof o f Lemma 3, we app ly th e above argument fo r all values of p , and conclud e via Fact 7 that r ank( G p L − 1 ) = N f or e very p , ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) - a.e.. Since for every w , G L − 1 = G p L − 1 for so me p which depen ds on w , this implies th at, ( X , E 1 , . . . , E L − 1 , W 1 , ..., W L − 2 ) -a.e., rank( G L − 1 ) = N simu ltaneously for all v alues of W L − 1 . Thus in any DLM of the MSE, with all weights except W L − 1 ﬁxed, we can use eq. 5.4 ( G L − 1 e = 0 ), and get e = 0 . 12

No bad local minima: Data independent training error guarantees for multilayer neural networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment