Identity Matters in Deep Learning

Identity Matters in Deep Learning Moritz Hardt ∗ . T engyu Ma † July 23, 2018 Abstract An emerging design principle in deep learning is that each layer of a deep artiﬁcial neural network should be able to easily express the identity transforma- tion. This idea not only motiv ated v arious normalization techniques, such as batch normalization , but w as also key to the immense success of r esidual networks . In this work, we put the principle of identity parameterization on a more solid theoretical footing alongside further empirical progress. W e ﬁrst gi ve a strikingly simple proof that arbitrarily deep linear residual networks ha ve no spurious local optima. The same result for linear feed-forward netw orks in their standard param- eterization is substantially more delicate. Second, we show that residual networks with ReLu acti vations hav e univ ersal ﬁnite-sample expressivity in the sense that the network can represent an y function of its sample provided that the model has more parameters than the sample size. Directly inspired by our theory , we experiment with a radically simple residual architecture consisting of only residual con volutional layers and ReLu activ ations, but no batch normalization, dropout, or max pool. Our model improves signiﬁ- cantly on previous all-conv olutional networks on the CIF AR10, CIF AR100, and ImageNet classiﬁcation benchmarks. 1 Intr oduction T raditional con volutional neural networks for image classiﬁcation, such as AlexNet ([13]), are parameterized in such a way that when all trainable weights are 0 , a con volu- tional layer represents the 0 -mapping. Moreover , the weights are initialized symmetri- cally around 0 . This standard parameterization mak es it non-trivial for a con volutional layer trained with stochastic gradient methods to preserve features that were already good. Put dif ferently , such con volutional layers cannot easily con ver ge to the identity transformation at training time. This shortcoming was observed and partially addressed by [9] through batch nor- malization , i.e., layer-wise whitening of the input with a learned mean and co variance. But the idea remained some what implicit until r esidual networks ([6]; [7]) explicitly introduced a reparameterization of the con volutional layers such that when all trainable ∗ Google Brain. m@mrtz.org † Stanford Univ ersity . tengyu@stanford.edu . W ork performed at Google. 1 weights are 0 , the layer represents the identity function. Formally , for an input x, each residual layer has the form x + h ( x ) , rather than h ( x ) . This simple reparameteriza- tion allows for much deeper architectures largely av oiding the problem of v anishing (or exploding) gradients. Residual networks, and subsequent architectures that use the same parameterization, hav e since then consistently achiev ed state-of-the-art results on various computer vision benchmarks such as CIF AR10 and ImageNet. 1.1 Our contributions In this work, we consider identity parameterizations from a theoretical perspectiv e, while translating some of our theoretical insight back into experiments. Loosely speak- ing, our ﬁrst result underlines how identity parameterizations make optimization easier , while our second result shows the same is true for representation. Linear residual networks. Since general non-linear neural networks, are be yond the reach of current theoretical methods in optimization, we consider the case of deep lin- ear netw orks as a simpliﬁed model. A linear network represents an arbitrary linear map as a sequence of matrices A ` · · · A 2 A 1 . The objectiv e function is E k y − A ` · · · A 1 x k 2 , where y = Rx for some unknown linear transformation R and x is drawn from a dis- tribution. Such linear netw orks hav e been studied actively in recent years as a stepping stone to ward the general non-linear case (see Section 1.2). Even though A ` · · · A 1 is just a linear map, the optimization problem o ver the factored v ariables ( A ` , . . . , A 1 ) is non-con ve x. In a nalogy with residual networks, we will instead parameterize the objecti ve func- tion as min A 1 ,...,A ` E k y − ( I + A ` ) · · · ( I + A 1 ) x k 2 . (1.1) T o give some intuition, when the depth ` is lar ge enough, we can hope that the tar get function R has a f actored representation in which each matrix A i has small norm. An y symmetric positiv e semideﬁnite matrix O can, for example, be written as a product O = O ` · · · O 1 , where each O i = O 1 /` is very close to the identity for large ` so that A i = O i − I has small spectral norm. W e ﬁrst prove that an analogous claim is true for all linear transformations R with positiv e determinant 1 . Speciﬁcally , we prov e that for ev ery linear transformation R with det( R ) > 0 , there exists a global optimizer ( A 1 , . . . , A ` ) of (1.1) such that for large enough depth `, max 1 ≤ i ≤ ` k A i k ≤ O (1 /` ) . (1.2) Here, k A k denotes the spectral norm of A. The constant factor depends on the con- ditioning of R . W e give the formal statement in Theorem 2.1. The theorem has the interesting consequence that as the depth increases, smaller norm solutions exist and hence regularization may of fset the increase in parameters. 1 As will be discussed below Theorem 2.1, it is without loss of generality to assume that the determinant of R is positive. 2 Having established the existence of small norm solutions, our main result on lin- ear residual networks shows that the objecti ve function (1.1) is, in fact, easy to op- timize when all matrices hav e suf ﬁciently small norm. More formally , letting A = ( A 1 , . . . , A ` ) and f ( A ) denote the objectiv e function in (1.1), we can sho w that the gradients vanish only when f ( A ) = 0 pro vided that max i k A i k ≤ O (1 /` ) . See Theo- rem 2.2. This result implies that linear residual networks have no critical points other than the global optimum. In contrast, for standard linear neural networks we only know , by work of [12] that these networks don’t ha ve local optima e xcept the global optimum, but it doesn’ t rule out other critical points. In fact, setting A i = 0 will al ways lead to a bad critical point in the standard parameterization. Universal ﬁnite sample expressivity . Going back to non-linear residual networks with ReLU activ ations, we can ask: Ho w expressi ve are deep neural networks that are solely based on residual layers with ReLU activ ations? T o answer this question, we gi ve a very simple construction sho wing that such residual networks hav e perfect ﬁnite sample expressi vity . In other words, a residual network with ReLU activ ations can easily express any functions of a sample of size n, provided that it has sufﬁciently more than n parameters. Note that this requirement is easily met in practice. On CIF AR 10 ( n = 50000 ), for example, successful residual networks often ha ve more than 10 6 parameters. More formally , for a data set of size n with r classes, our construction requires O ( n log n + r 2 ) parameters. Theorem 3.2 gi ves the formal statement. Each residual layer in our construction is of the form x + V ReLU( U x ) , where U and V are linear transformations. These layers are signiﬁcantly simpler than standard residual layers, which typically have two ReLU acti vations as well as two instances of batch normalization. The power of all-conv olutional residual networks. Directly inspired by the sim- plicity of our e xpressivity result, we experiment with a v ery similar architecture on the CIF AR10, CIF AR100, and ImageNet data sets. Our architecture is merely a chain of con volutional residual layers each with a single ReLU activ ation, but without batch nor- malization, dropout, or max pooling as are common in standard architectures. The last layer is a ﬁxed random projection that is not trained. In line with our theory , the con- volutional weights are initialized near 0 , using Gaussian noise mainly as a symmetry breaker . The only regularizer is standard weight decay ( ` 2 -regularization) and there is no need for dropout. Despite its simplicity , our architecture reaches 6 . 38% top- 1 clas- siﬁcation error on the CIF AR10 benchmark (with standard data augmentation). This is competiti ve with the best residual network reported in [6], which achie ved 6 . 43% . Moreov er , it improves upon the performance of the pre vious best all-con volutional network, 7 . 25% , achie ved by [15]. Unlike ours, this previous all-conv olutional archi- tecture additionally required dropout and a non-standard preprocessing (ZCA) of the entire data set. Our architecture also improves signiﬁcantly upon [15] on both Cifar100 and ImageNet. 3 1.2 Related W ork Since the advent of residual networks ([6]; [7]), most state-of-the-art networks for im- age classiﬁcation ha ve adopted a residual parameterization of the con volutional layers. Further impressive improvements were reported by [8] with a v ariant of residual net- works, called dense nets . Rather than adding the original input to the output of a con volutional layer, these networks preserve the original features directly by concate- nation. In doing so, dense nets are also able to easily encode an identity embedding in a higher -dimensional space. It would be interesting to see if our theoretical results also apply to this variant of residual netw orks. There has been recent progress on understanding the optimization landscape of neural networks, though a comprehensi ve answer remains elusi ve. Experiments in [5] and [4] suggest that the training objectiv es have a limited number of bad local minima with large function v alues. W ork by [3] draws an analogy between the optimization landscape of neural nets and that of the spin glass model in physics ([1]). [14] showed that 2 -layer neural networks ha ve no bad differ entiable local minima, but they didn’t prov e that a good differentiable local minimum does exist. [2] and [12] sho w that linear neural networks have no bad local minima. In contrast, we show that the optimization landscape of deep linear residual netw orks has no bad critical point, which is a stronger and more desirable property . Our proof is also notably simpler illustrating the power of re-parametrization for optimization. Our results also indicate that deeper networks may hav e more desirable optimization landscapes compared with shallower ones. 2 Optimization landscape of linear r esidual networks Consider the problem of learning a linear transformation R : R d → R d from noisy measurements y = R x + ξ , where ξ ∈ N (0 , I d ) is a d -dimensional spherical Gaussian vector . Denoting by D the distribution of the input data x, let Σ = E x ∼D [ xx > ] be its cov ariance matrix. There are, of course, man y ways to solve this classical problem, but our goal is to gain insights into the optimization landscape of neural nets, and in particular , resid- ual networks. W e therefore parameterize our learned model by a sequence of weight matrices A 1 , . . . , A ` ∈ R d × d , h 0 = x , h j = h j − 1 + A j h j − 1 , ˆ y = h ` . (2.1) Here h 1 , . . . , h ` − 1 are the ` − 1 hidden layers and ˆ y = h ` are the predictions of the learned model on input x. More succinctly , we have ˆ y = ( I + A ` ) . . . ( I + A 1 ) x . It is easy to see that this model can express any linear transformation R. W e will use A as a shorthand for all of the weight matrices, that is, the ` × d × d -dimensional tensor that contains A 1 , . . . , A ` as slices. Our objectiv e function is the maximum likelihood estimator , f ( A, ( x, y )) = k ˆ y − y k 2 = k ( I + A ` ) . . . ( I + A 1 ) x − R x − ξ k 2 . (2.2) 4 W e will analyze the landscape of the population risk , deﬁned as, f ( A ) := E [ f ( A, ( x, y ))] . Recall that k A i k is the spectral norm of A i . W e deﬁne the norm | | |·| | | for the tensor A as the maximum of the spectral norms of its slices, | | | A | | | := max 1 ≤ i ≤ ` k A i k . The ﬁrst theorem of this section states that the objecti ve function f has an optimal solution with small | | |·| | | -norm, which is in versely proportional to the number of layers ` . Thus, when the architecture is deep, we can shoot for fairly small norm solutions. W e deﬁne γ := max {| log σ max ( R ) | , | log σ min ( R ) |} . Here σ min ( · ) , σ max ( · ) denote the least and largest singular v alues of R respectiv ely . Theorem 2.1. Suppose ` ≥ 3 γ and det( R ) > 0 . Then, there exists a global optimum solution A ? of the population risk f ( · ) with norm | | | A ? | | | ≤ (4 π + 3 γ ) /` . W e ﬁrst note that the condition det( R ) > 0 is without loss of generality in the following sense. Gi ven any linear transformation R with negati ve determinant, we can effecti vely ﬂip the determinant by augmenting the data and the label with an additional dimension: let x 0 = [ x, b ] and y 0 = [ y , − b ] , where b is an independent random v ariable (say , from standard normal distrib ution), and let R 0 =  R 0 0 − 1  . Then, we have that y 0 = R 0 x 0 + ξ and det( R 0 ) = − det( R ) > 0 . 2 Second, we note that here γ should be thought of as a constant since if R is too large (or too small), we can scale the data properly so that σ min ( R ) ≤ 1 ≤ σ max ( R ) . Concretely , if σ max ( R ) /σ min ( R ) = κ , then we can scaling for the outputs properly so that σ min ( R ) = 1 / √ κ and σ max ( R ) = √ κ . In this case, we hav e γ = log √ κ , which will remain a small constant for f airly lar ge condition number κ . W e also point out that we made no attempt to optimize the constant factors here in the analysis. The proof of Theorem 2.1 is rather in volv ed and is deferred to Section A. Giv en the observ ation of Theorem 2.1, we restrict our attention to analyzing the landscape of f ( · ) in the set of A with | | |·| | | -norm less than τ , B τ = { A ∈ R ` × d × d : | | | A | | | ≤ τ } . Here using Theorem 2.1, the radius τ should be thought of as on the order of 1 /` . Our main theorem in this section claims that there is no bad critical point in the domain B τ for any τ < 1 . Recall that a critical point has vanishing gradient. Theorem 2.2. F or any τ < 1 , we have that any critical point A of the objective function f ( · ) inside the domain B τ must also be a global minimum. 2 When the dimension is odd, there is an easier way to see this – ﬂipping the label corresponds to ﬂipping R , and we have det( − R ) = − det( R ) . 5 Theorem 2.2 suggests that it is sufﬁcient for the optimizer to con ver ge to critical points of the population risk, since all the critical points are also global minima. Moreov er , in addition to Theorem 2.2, we also hav e that any A inside the domain B τ satisﬁes that k∇ f ( A ) k 2 F ≥ 4 ` (1 − τ ) 2 ` − 2 σ min (Σ)( f ( A ) − C opt ) . (2.3) Here C opt is the global minimal value of f ( · ) and k∇ f ( A ) k F denotes the euclidean norm 3 of the ` × d × d -dimensional tensor ∇ f ( A ) . Note that σ min (Σ) denote the minimum singular value of Σ . Equation (2.3) says that the gradient has fairly large norm compared to the error , which guarantees conv ergence of the gradi ent descent to a global minimum ([11]) if the iterates stay inside the domain B τ , which is not guaranteed by Theorem 2.2 by itself. T ow ards proving Theorem 2.2, we start of f with a simple claim that simpliﬁes the population risk. W e use k·k F to denote the Frobenius norm of a matrix, and h A, B i denotes the inner product of A and B in the standard basis (that is, h A, B i = tr ( A > B ) where tr ( · ) denotes the trace of a matrix.) Claim 2.3. In the setting of this section, we have, f ( A ) =    (( I + A ` ) . . . ( I + A 1 ) − R )Σ 1 / 2    2 F + C . (2.4) Her e C is a constant that doesn’t depend on A , and Σ 1 / 2 denote the square r oot of Σ , that is, the unique symmetric matrix B that satisﬁes B 2 = Σ . Pr oof of Claim 2.3. Let tr ( A ) denotes the trace of the matrix A . Let E = ( I + A ` ) . . . ( I + A 1 ) − R . Recalling the deﬁnition of f ( A ) and using equation (2.2), we hav e f ( A ) = E  k E x − ξ k 2  (by equation (2.2)) = E  k E x k 2 + k ξ k 2 − 2 h E x, ξ i  = E  tr ( E xx > E > )  + E  k ξ k 2  (since E [ h E x, ξ i ] = E [ h E x, E [ ξ | x ] i ] = 0 ) = tr  E E  xx >  E >  + C (where C = E [ ξ 2 ] ) = tr ( E Σ E > ) + C = k E Σ 1 / 2 k 2 F + C . (since E  xx >  = Σ ) Next we compute the gradients of the objective function f ( · ) from straightforward matrix calculus. W e defer the full proof to Section A. Lemma 2.4. The gradients of f ( · ) can be written as, ∂ f ∂ A i = 2( I + A > i +1 ) . . . ( I + A > ` ) E Σ( I + A > 1 ) . . . ( I + A > i − 1 ) , (2.5) wher e E = ( I + A ` ) . . . ( I + A 1 ) − R . 3 That is, k T k F := q P ij k T 2 ij k . 6 Now we are ready to prove Theorem 2.2. The key observation is that each matric A j has small norm and cannot cancel the identity matrix. Therefore, the gradients in equation (2.5) is a product of non-zero matrices, except for the error matrix E . There- fore, if the gradient vanishes, then the only possibility is that the matrix E vanishes, which in turns implies A is an optimal solution. Pr oof of Theor em 2.2. Using Lemma 2.4, we hav e,     ∂ f ∂ A i     F = 2   ( I + A > i +1 ) . . . ( I + A > ` ) E Σ( I + A > 1 ) . . . ( I + A > i − 1 )   F (by Lemma 2.4) ≥ 2 Y j 6 = i σ min ( I + A > i ) · σ min (Σ) k E k F (by Claim C.2) ≥ 2(1 − τ ) ` − 1 σ min (Σ 1 / 2 ) k E Σ 1 / 2 k F . (since σ min ( I + A ) ≥ 1 − k A k ) It follows that k∇ f ( A ) k 2 F = ` X i =1     ∂ f ∂ A i     2 F ≥ 4 ` (1 − τ ) 2( ` − 1) σ min (Σ) k E Σ 1 / 2 k 2 F = 4 ` (1 − τ ) 2( ` − 1) σ min (Σ)( f ( A ) − C ) (by the deﬁnition of E and Claim 2.3) ≥ 4 ` (1 − τ ) 2( ` − 1) σ min (Σ)( f ( A ) − C opt ) . (since C opt := min A f ( A ) ≥ C by Claim 2.3) Therefore we complete the proof of equation (2.3). Finally , if A is a critical point, namely , ∇ f ( A ) = 0 , then by equation (2.3) we hav e that f ( A ) = C opt . That is, A is a global minimum. 3 Repr esentational Power of the Residual Netw orks In this section we characterize the ﬁnite-sample expressivity of residual networks. W e consider a residual layers with a single ReLU acti vation and no batch normalization. The basic residual building block is a function T U,V ,s ( · ) : R k → R k that is parameter- ized by two weight matrices U ∈ R k × k , V ∈ R k × k and a bias vector s ∈ R k , T U,V ,s ( h ) = V ReLu ( U h + s ) . (3.1) A residual network is composed of a sequence of such residual blocks. In comparison with the full pre-activ ation architecture in [7], we remove two batch normalization layers and one ReLU layer in each building block. W e assume the data has r labels, encoded as r standard basis vectors in R r , denoted by e 1 , . . . , e r . W e have n training examples ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ) , where x ( i ) ∈ R d denotes the i -th data and y ( i ) ∈ { e 1 , . . . , e r } denotes the i -th label. W ithout loss of generality we assume the data are normalized so that x ( i ) = 1 . W e also make the mild assumption that no two data points are very close to each other . 7 Assumption 3.1. W e assume that for every 1 ≤ i < j ≤ n , we have k x ( i ) − x ( j ) k 2 ≥ ρ for some absolute constant ρ > 0 . Images, for e xample, can alw ays be imperceptibly perturbed in pixel space so as to satisfy this assumption for a small but constant ρ. Under this mild assumption, we prov e that residual networks have the power to express any possible labeling of the data as long as the number of parameters is a logarithmic factor lar ger than n . Theorem 3.2. Suppose the training examples satisfy Assumption 3.1. Then, ther e exists a r esidual network N (speciﬁed below) with O ( n log n + r 2 ) parameters that perfectly expr esses the training data, i.e., for all i ∈ { 1 , . . . , n } , the network N maps x ( i ) to y ( i ) . It is common in practice that n > r 2 , as is for example the case for the Imagenet data set where n > 10 6 and r = 1000 . W e construct the following residual net using the building blocks of the form T U,V ,s as deﬁned in equation (3.1). The network consists of ` + 1 hidden layers h 0 , . . . , h ` , and the output is denoted by ˆ y ∈ R r . The ﬁrst layer of weights matrices A 0 maps the d -dimensional input to a k -dimensional hidden v ariable h 0 . Then we apply ` layers of building block T with weight matrices A j , B j ∈ R k × k . Finally , we apply another layer to map the hidden variable h ` to the label ˆ y in R k . Mathematically , we have h 0 = A 0 x , h j = h j − 1 + T A j ,B j ,b j ( h j − 1 ) , ∀ j ∈ { 1 , . . . , ` } ˆ y = T A ` +1 ,B ` +1 ,s ` +1 ( h ` ) . W e note that here A ` +1 ∈ R k × r and B ` +1 ∈ R r × r so that the dimension is compatible. W e assume the number of labels r and the input dimension d are both smaller than n , which is safely true in practical applications. 4 The hyperparameter k will be chosen to be O (log n ) and the number of layers is chosen to be ` = d n/k e . Thus, the ﬁrst layer has dk parameters, and each of the middle ` building blocks contains 2 k 2 param- eters and the ﬁnal building block has kr + r 2 parameters. Hence, the total number of parameters is O ( k d + `k 2 + r k + r 2 ) = O ( n log n + r 2 ) . T ow ards constructing the network N of the form above that ﬁts the data, we ﬁrst take a random matrix A 0 ∈ R k × d that maps all the data points x ( i ) to vectors h ( i ) 0 := A 0 x ( i ) . Here we will use h ( i ) j to denote the j -th layer of hidden variable of the i -th example. By Johnson-Lindenstrauss Theorem ([10], or see [17]), with good probabil- ity , the resulting vectors h ( i ) 0 ’ s remain to satisfy Assumption 3.1 (with slightly dif ferent scaling and larger constant ρ ), that is, any tw o vectors h ( i ) 0 and h ( j ) 0 are not very corre- lated. Then we construct ` middle layers that maps h ( i ) 0 to h ( i ) ` for every i ∈ { 1 , . . . , n } . These vectors h ( i ) ` will clustered into r groups according to the labels, though they are in the R k instead of in R r as desired. Concretely , we design this cluster centers 4 In computer vision, typically r is less than 10 3 and d is less than 10 5 while n is larger than 10 6 8 by picking r random unit vectors q 1 , . . . , q r in R k . W e view them as the surrogate label vectors in dimension k (note that k is potentially much smaller than r ). In high dimensions (technically , if k > 4 log r ) random unit vectors q 1 , . . . , q r are pair-wise uncorrelated with inner product less than < 0 . 5 . W e associate the i -th example with the target surrogate label v ector v ( i ) deﬁned as follows, if y ( i ) = e j , then v ( i ) = q j . (3.2) Then we will construct the matrices ( A 1 , B 1 ) , . . . , ( A ` , B ` ) such that the ﬁrst ` lay- ers of the network maps vector h ( i ) 0 to the surrogate label vector v ( i ) . Mathematically , we will construct ( A 1 , B 1 ) , . . . , ( A ` , B ` ) such that ∀ i ∈ { 1 , . . . , n } , h ( i ) ` = v ( i ) . (3.3) Finally we will construct the last layer T A ` +1 ,B ` +1 ,b ` +1 so that it maps the vectors q 1 , . . . , q r ∈ R k to e 1 , . . . , e r ∈ R r , ∀ j ∈ { 1 , . . . , r } , T A ` +1 ,B ` +1 ,b ` +1 ( q j ) = e j . (3.4) Putting these together , we have that by the deﬁnition (3.2) and equation (3.3), for e very i , if the label is y ( i ) is e j , then h ( i ) ` will be q j . Then by equation (3.4), we ha ve that ˆ y ( i ) = T A ` +1 ,B ` +1 ,b ` +1 ( q j ) = e j . Hence we obtain that ˆ y ( i ) = y ( i ) . The k ey part of this plan is the construction of the middle ` layers of weight matrices so that h ( i ) ` = v ( i ) . W e encapsulate this into the following informal lemma. The formal statement and the full proof is deferred to Section B. Lemma 3.3 (Informal v ersion of Lemma B.2) . In the setting above, for (almost) ar- bitrary vectors h (1) 0 , . . . , h ( n ) 0 and v (1) , . . . , v ( n ) ∈ { q 1 , . . . , q r } , there exists weights matrices ( A 1 , B 1 ) , . . . , ( A ` , B ` ) , such that, ∀ i ∈ { 1 , . . . , n } , h ( i ) ` = v ( i ) . W e brieﬂy sketch the proof of the Lemma to provide intuitions, and defer the full proof to Section B. The operation that each residual block applies to the hidden variable can be abstractly written as, ˆ h → h + T U,V ,s ( h ) . (3.5) where h corresponds to the hidden variable before the block and ˆ h corresponds to that after . W e claim that for an (almost) arbitrary sequence of vectors h (1) , . . . , h ( n ) , there exists T U,V ,s ( · ) such that operation (3.5) transforms k vectors of h ( i ) ’ s to an arbitrary set of other k vectors that we can freely choose, and maintain the value of the rest of n − k vectors. Concretely , for any subset S of size k , and any desired vector v ( i ) ( i ∈ S ) , there exist U, V , s such that v ( i ) = h ( i ) + T U,V ,s ( h ( i ) ) ∀ i ∈ S h ( i ) = h ( i ) + T U,V ,s ( h ( i ) ) ∀ i 6∈ S (3.6) 9 This claim is formalized in Lemma B.1. W e can use it repeatedly to construct ` layers of building blocks, each of which transforms a subset of k v ectors in { h (1) 0 , . . . , h ( n ) 0 } to the corresponding vectors in { v (1) , . . . , v ( n ) } , and maintains the values of the others. Recall that we hav e ` = d n/k e layers and therefore after ` layers, all the vectors h ( i ) 0 ’ s are transformed to v ( i ) ’ s, which complete the proof sketch. 4 P ower of all-con v olutional residual networks Inspired by our theory , we experimented with all-con volutional residual networks on standard image classiﬁcation benchmarks. 4.1 CIF AR10 and CIF AR100 Our architectures for CIF AR10 and CIF AR100 are identical except for the ﬁnal dimen- sion corresponding to the number of classes 10 and 100 , respecti vely . In T able 1, we outline our architecture. Each r esidual bloc k has the form x + C 2 (ReLU( C 1 x )) , where C 1 , C 2 are conv olutions of the speciﬁed dimension (kernel width, kernel height, num- ber of input channels, number of output channels). The second con volution in each block always has stride 1 , while the ﬁrst may ha ve stride 2 where indicated. In cases where transformation is not dimensionality-preserving, the original input x is adjusted using av eraging pooling and padding as is standard in residual layers. W e trained our models with the T ensorﬂow frame work, using a momentum opti- mizer with momentum 0 . 9 , and batch size is 128 . All con v olutional weights are trained with weight decay 0 . 0001 . The initial learning rate is 0 . 05 , which drops by a factor 10 and 30000 and 50000 steps. The model reaches peak performance at around 50 k steps, which takes about 24 h on a single NVIDIA T esla K40 GPU. Our code can be easily deriv ed from an open source implementation 5 by removing batch normalization, ad- justing the residual components and model architecture. An important departure from the code is that we initialize a residual con volutional layer of kernel size k × k and c output channels using a random normal initializer of standard deviation σ = 1 /k 2 c, rather than 1 /k √ c used for standard conv olutional layers. This substantially smaller weight initialization helped training, while not affecting representation. A notable dif ference from standard models is that the last layer is not trained, but simply a ﬁxed random projection. On the one hand, this slightly improv ed test er - ror (perhaps due to a regularizing effect). On the other hand, it means that the only trainable weights in our model are those of the con volutions, making our architecture “all-con volutional”. An interesting aspect of our model is that despite its massive size of 13 . 59 million trainable parameters, the model does not seem to o verﬁt too quickly even though the data set size is 50000 . In contrast, we found it difﬁcult to train a model with batch normalization of this size without signiﬁcant ov erﬁtting on CIF AR10. T able 2 summarizes the top- 1 classiﬁcation error of our models compared with a non-exhausti ve list of previous works, restricted to the best previous all-con volutional 5 https://github.com/tensorflow/models/tree/master/resnet 10 T able 1: Architecture for CIF AR10/100 ( 55 con v olutions, 13 . 5 M parameters) variable dimensions initial stride description 3 × 3 × 3 × 16 1 1 standard con v 3 × 3 × 16 × 64 1 9 residual blocks 3 × 3 × 64 × 128 2 9 residual blocks 3 × 3 × 128 × 256 2 9 residual blocks – – 8 × 8 global av erage pool 256 × num classes – random projection (not trained) 0 10 20 30 40 50 60 Steps (x1000) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Precision Cifar10 Precision train test min 0 10 20 30 40 50 60 Steps (x1000) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Precision Cifar100 Precision train test min Figure 1: Con vergence plots of best model for CIF AR10 (left) and CIF AR (100) right. One step is a gradient update with batch size 128 . result by [15], the ﬁrst residual results [6], and state-of-the-art results on CIF AR by [8]. All results are with standard data augmentation. 4.2 ImageNet The ImageNet ILSVRC 2012 data set has 1 , 281 , 167 data points with 1000 classes. Each image is resized to 224 × 224 pixels with 3 channels. W e experimented with an all-con volutional variant of the 34 -layer network in [6]. The original model achiev ed 25 . 03% classiﬁcation error . Our deriv ed model has 35 . 7 M trainable parameters. W e trained the model with a momentum optimizer (with momentum 0 . 9 ) and a learning rate schedule that decays by a f actor of 0 . 94 e very tw o epochs, starting from the initial learning rate 0 . 1 . Training was distributed across 6 machines updating asynchronously . Each machine was equipped with 8 GPUs (NVIDIA T esla K40) and used batch size 256 split across the 8 GPUs so that each GPU updated with batches of size 32 . In contrast to the situation with CIF AR10 and CIF AR100, on ImageNet our all- con volutional model performed signiﬁcantly worse than its original counterpart. Specif- ically , we experienced a signiﬁcant amount of underﬁtting suggesting that a larger 11 T able 2: Comparison of top- 1 classiﬁcation error on different benchmarks Method CIF AR10 CIF AR100 ImageNet remarks All-CNN 7 . 25 32 . 39 41 . 2 all-conv olutional, dropout extra data processing Ours 6 . 38 24 . 64 35 . 29 all-con volutional ResNet 6 . 43 25 . 16 19 . 38 DenseNet 3 . 74 19 . 25 N/A model would likely perform better . Despite this issue, our model still reached 35 . 29% top- 1 classiﬁcation error on the test set ( 50000 data points), and 14 . 17% top- 5 test error after 700 , 000 steps (about one week of training). While no longer state-of-the-art, this performance is signiﬁcantly better than the 40 . 7% reported by [13], as well as the best all-con volutional archi- tecture by [15]. W e believ e it is quite likely that a better learning rate schedule and hyperparameter settings of our model could substantially improv e on the preliminary performance reported here. 5 Conclusion Our theory underlines the importance of identity parameterizations when training deep artiﬁcial neural networks. An outstanding open problem is to extend our optimization result to the non-linear case where each residual has a single ReLU acti viation as in our expressi vity result. W e conjecture that a result analogous to Theorem 2.2 is true for the general non-linear case. Unlike with the standard parameterization, we see no fundamental obstacle for such a result. W e hope our theory and experiments together help simplify the state of deep learn- ing by aiming to explain its success with a few fundamental principles, rather than a multitude of tricks that need to be delicately combined. W e believ e that much of the advances in image recognition can be achieved with residual con volutional layers and ReLU activ ations alone. This could lead to extremely simple (albeit deep) architectures that match the state-of-the-art on all image classiﬁcation benchmarks. Acknowledgment: W e thank Jason D. Lee, Qixing Huang, and Jonathan Shewchuk for helpful discussions and kindly pointing out errors in earlier v ersions of the paper . W e also thank Jonathan She wchuk for suggesting an improv ement of equation (2.3) that is incorporated into the current version. T engyu Ma would like to thank the support by Dodds Fellowship and Siebel Scholarship. Refer ences [1] Antonio Aufﬁnger , G ´ erard Ben Arous, and Ji ˇ r ´ ı ˇ Cern ` y. Random matrices and complexity of spin glasses. Communications on Pur e and Applied Mathematics , 66(2):165–201, 2013. 12 [2] P . Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Netw . , 2(1):53–58, Jan- uary 1989. [3] Anna Choromanska, Mikael Henaff, Michael Mathieu, G ´ erard Ben Arous, and Y ann LeCun. The loss surfaces of multilayer networks. In AIST A TS , 2015. [4] Y ann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Y oshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-con ve x optimization. In Advances in neural information pr ocessing systems , pages 2933–2941, 2014. [5] I. J. Goodfello w, O. V inyals, and A. M. Sax e. Qualitativ ely characterizing neural network optimization problems. ArXiv e-prints , December 2014. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn- ing for image recognition. In arXiv prepring , 2015. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer V ision - ECCV 2016 - 14th Eur opean Confer ence, Amsterdam, The Netherlands, October 11-14, 2016, Pr oceedings, P art IV , pages 630–645, 2016. [8] Gao Huang, Zhuang Liu, and Kilian Q. W einberger . Densely connected con volu- tional networks. CoRR , abs/1608.06993, 2016. [9] Serge y Ioffe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. In Pr oceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, F rance , 6-11 J uly 2015 , pages 448–456, 2015. [10] W illiam B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics , 26(189-206):1, 1984. [11] H. Karimi, J. Nutini, and M. Schmidt. Linear Con vergence of Gradient and Proximal-Gradient Methods Under the Polyak- \ L {} ojasiewicz Condition. ArXiv e-prints , August 2016. [12] K. Kawaguchi. Deep Learning without Poor Local Minima. ArXiv e-prints , May 2016. [13] Alex Krizhe vsky , Ilya Sutske ver , and Geoffre y E Hinton. Imagenet classiﬁca- tion with deep con volutional neural networks. In Advances in neural information pr ocessing systems , pages 1097–1105, 2012. [14] D. Soudry and Y . Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. ArXiv e-prints , May 2016. [15] J. T . Springenberg, A. Dosovitskiy, T . Brox, and M. Riedmiller. Striving for Simplicity: The All Con volutional Net. ArXiv e-prints , December 2014. 13 [16] Eric W . W eisstein. Normal matrix, from mathworld–a wolfram web resource., 2016. [17] W ikipedia. Johnsonlindenstrauss lemma — wikipedia, the free encyclopedia, 2016. A Missing Pr oofs in Section 2 In this section, we giv e the complete proofs for Theorem 2.1 and Lemma 2.4, which are omitted in Section 2. A.1 Proof of Theor em 2.1 It turns out the proof will be signiﬁcantly easier if R is assumed to be a symmetric positive semideﬁnite (PSD) matrix, or if we allow the v ariables to be comple x matrices. Here we ﬁrst giv e a proof sketch for the ﬁrst special case. The readers can skip it and jumps to the full proof below . W e will also prove stronger results, namely , | | | A ? | | | ≤ 3 γ /` , for the special case. When R is PSD, it can be diagonalized by orthonormal matrix U in the sense that R = U Z U > , where Z = diag( z 1 , . . . , z d ) is a diagonal matrix with non-negati ve diagonal entries z 1 , . . . , z d . Let A ? 1 = · · · = A ? ` = U diag( z 1 /` i ) U > − I , then we hav e ( I + A ? ` ) · · · ( I + A ? 1 ) = ( U diag( z 1 /` i ) U > ) ` = U diag( z 1 /` i ) ` U (since U > U = I ) = U Z U > = R . W e see that the network deﬁned by A ? reconstruct the transformation R , and there- fore it’ s a global minimum of the population risk (formally see Claim 2.3 below). Next, we verify that each of the A ? j has small spectral norm: k A ? j k = k I − U diag ( z 1 /` i ) U > ) k = k U ( I − diag( z i ) 1 /` ) U > k = k I − diag ( z i ) 1 /` k (since U is orthonormal) = max i | z 1 /` i − 1 | . (A.1) Since σ min ( R ) ≤ z i ≤ σ max ( R ) , we hav e ` ≥ 3 γ ≥ | log z i | . It follo ws that | z 1 /` i − 1 | = | e (log z i ) /` − 1 | ≤ 3 | (log z i ) /` | ≤ 3 γ /` . (since | e x − 1 | ≤ 3 | x | for all | x | ≤ 1 ) Then using equation (A.1) and the equation abo ve, we ha ve that | | | A | | | ≤ max j k A ? j k ≤ 3 γ /` , which completes the proof for the special case. T ow ards fully proving the Theorem 2.1, we start with the follo wing Claim: Claim A.1. Suppose Q ∈ R 2 × 2 is an orthonormal matrix. Then for any inte ger q , ther e exists matrix W 1 , . . . , W q ∈ R 2 × 2 and a diagonal matrix Λ satisﬁes that (a) Q = W 1 . . . W q Λ and k W j − I k ≤ π /q , (b) Λ is an diagonal matrix with ± 1 on the diagonal, and (c) If Q is a r otation then Λ = I . 14 Pr oof. W e ﬁrst consider the case when Q is a rotation. Each rotation matrix can be written as T ( θ ) :=  cos θ − sin θ sin θ cos θ  . Suppose Q = T ( θ ) . Then we can take W 1 = · · · = W q = T ( θ /q ) and Λ = I . W e can verify that k W j − I k ≤ θ /q . Next, we consider the case when Q is a reﬂection. Then we have that Q can be written as Q = T ( θ ) · diag ( − 1 , 1) , where diag( − 1 , 1) is the reﬂection with respect to the y -axis. Then we can take W 1 = · · · = W q = T ( θ/q ) and Λ = diag ( − 1 , 1) and complete the proof. Next we gi ve the formal full proof of Theorem 2.1. The main idea is to reduce to the block diagonal situation and to apply the Claim abov e. Pr oof of Theor em 2.1. Let R = U K V > be the singular value decomposition of R , where U , V are two orthonormal matrices and K is a diagonal matrix with nonneg- ativ e entries on the diagonal. Since det( R ) = det( U ) det( K ) det( V ) > 0 and det( K ) > 0 , we can ﬂip U, V properly so that det( U ) = det( V ) = 1 . Since U is a normal matrix (that is, U satisﬁes that U U > = U > U ), by Claim C.1, we hav e that U can be block-diagonalized by orthonormal matrix S into U = S D S − 1 , where D = diag ( D 1 , . . . , D m ) is a real block diagonal matrix with each block D i being of size at most 2 × 2 . Using Claim A.1, we have that for any D i , there exists W i, 1 , . . . , W i,q , Λ i such that D i = W i, 1 . . . W i,q Λ i (A.2) and k W i,j − I k ≤ π /q . Let Λ = diag(Λ 1 , . . . , Λ m ) and W j = diag( W 1 ,j , . . . W m,j ) . W e can rewrite equation (A.2) as D = W 1 . . . W q Λ . (A.3) Moreov er , we ha ve that Λ is a diagonal matrix with ± 1 on the diagonal. Since W i,j ’ s are orthonormal matrix with determinant 1, we hav e det(Λ) = det( D ) = det( U ) = 1 . That is, Λ has an ev en number of − 1 ’ s on the diagonal. Then we can group the − 1 ’ s into 2 × 2 blocks. Note that  − 1 0 0 − 1  is the rotation matrix T ( π ) . Thus we can write Λ as a concatenation of +1 ’ s on the diagonal and block T ( π ) . Then applying Claim A.1 (on each of the block T ( π ) ), we obtain that there are W 0 1 , . . . , W 0 q such that Λ = W 0 1 . . . W 0 q (A.4) where k W 0 j − I k ≤ π /q . Thus using equation (A.3) and (2.3), we obtain that U = S D S − 1 = S W 1 S − 1 · · · S W q S − 1 · S W 0 1 S − 1 · · · S W 0 q S − 1 . Moreov er , we hav e that for ev ery j , k S W j S − 1 − I k = k S ( W j − I ) S − 1 k = k W j − I k ≤ π /q , because S is an orthonormal matrix. The same can be prov ed for 15 W 0 j . Thus let B j = S W j S − 1 − I for j ≤ q and B j + q = S W 0 j S − 1 − I , and we can rewrite, U = ( I + B 1 ) . . . ( I + B q ) . W e can deal with V similarly by decomposing V into 2 q matrices that are π /q close to identity matrix, V > = ( I + B 0 1 ) . . . ( I + B 0 2 q ) . Last, we deal with the diagonal matrix K . Let K = diag( k i ) . W e hav e min k i = σ min ( R ) , max k i = σ max ( R ) . Then, we can write K = ( K 0 ) p where K 0 = diag( k 1 /p i ) and p is an integer to be chosen later . W e have that k K 0 − I k ≤ max | k 1 /p i − 1 | ≤ max | e log k i · 1 /p − 1 | . When p ≥ γ = max { log max k i , − log min k i } = max { log σ max ( R ) , − log σ min ( R ) } , we hav e that k K 0 − I k ≤ max | e log k i · 1 /p − 1 | ≤ 3 max | log k i · 1 /p | = 3 γ /p . (since | e x − 1 | ≤ 3 | x | for | x | ≤ 1 ) Let B 00 1 = · · · = B 00 p = K 0 − I and then we ha ve K = ( I + B 00 p ) · · · ( I + B 00 1 ) . Finally , we choose p = 3 γ ` 4 π +3 γ and q = π ` 4 π +3 γ , 6 and let A p +4 q = B 2 q , · · · = A p +2 q +1 = B 1 , A p +2 q = B 00 p , . . . , A 2 q +1 = B 00 1 , A 2 q = B 0 2 q , . . . , A 1 = B 0 1 . W e hav e that 4 q + p = ` and R = U K V > = ( I + A ` ) . . . ( I + A 1 ) . Moreov er , we have | | | A | | | ≤ max {k B j k , k B 0 j k . k B 00 j k} ≤ max { π /q , 3 γ /p } ≤ 4 π +3 γ ` , as desired. A.2 Proof of Lemma 2.4 W e compute the partial gradients by deﬁnition. Let ∆ j ∈ R d × d be an inﬁnitesimal change to A j . Using Claim 2.3, consider the T aylor expansion of f ( A 1 , . . . , A ` + ∆ j , . . . , A ` ) f ( A 1 , . . . , A ` + ∆ j , . . . , A ` ) =    (( I + A ` ) · · · ( I + A j + ∆ j ) . . . ( I + A 1 ) − R )Σ 1 / 2    2 F =    (( I + A ` ) · · · ( I + A 1 ) − R )Σ 1 / 2 + ( I + A ` ) · · · ∆ j . . . ( I + A 1 )Σ 1 / 2    2 F =    (( I + A ` ) · · · ( I + A 1 ) − R )Σ 1 / 2    2 F + 2 h (( I + A ` ) · · · ( I + A 1 ) − R )Σ 1 / 2 , ( I + A ` ) · · · ∆ j . . . ( I + A 1 )Σ 1 / 2 i + O ( k ∆ j k 2 F ) = f ( A ) + 2 h ( I + A > j +1 ) . . . ( I + A > ` ) E Σ( I + A > 1 ) . . . ( I + A > j − 1 ) , ∆ j i + O ( k ∆ j k 2 F ) . By deﬁnition, this means that the ∂ f ∂ A j = 2( I + A > ` ) . . . ( I + A > j +1 ) E Σ( I + A > j − 1 ) . . . ( I + A > 1 ) . 6 Here for notational con venience, p, q are not chosen to be integers. But rounding them to closest inte ger will change ﬁnal bound of the norm by small constant factor . 16 B Missing Pr oofs in Section 3 In this section, we provide the full proof of Theorem 3.2. W e start with the following Lemma that constructs a building block T that transform k vectors of an arbitrary sequence of n vectors to any arbitrary set of vectors, and main the value of the others. For better abstraction we use α ( i ) , β ( i ) to denote the sequence of vectors. Lemma B.1. Let S ⊂ [ n ] be of size k . Suppose α (1) , . . . , α ( n ) is a sequences of n vectors satisfying a) for every 1 ≤ i ≤ n , we have 1 − ρ 0 ≤ k α i k 2 ≤ 1 + ρ 0 , and b) if i 6 = j and S contains at least one of i, j , then k α ( i ) − α ( j ) k ≥ 3 ρ 0 . Let β (1) , . . . , β ( n ) be an arbitrary sequence of vectors. Then, there exists U, V ∈ R k × k , s suc h that for every i ∈ S , we have T U,V ,s ( α ( i ) ) = β ( i ) − α ( i ) , and mor eover , for e very i ∈ [ n ] \ S we have T U,V ,s ( α ( i ) ) = 0 . W e can see that the conclusion implies β ( i ) = α ( i ) + T U,V ,s ( α ( i ) ) ∀ i ∈ S α ( i ) = α ( i ) + T U,V ,s ( α ( i ) ) ∀ i 6∈ S which is a different w ay of writing equation (3.6). Pr oof of Lemma B.1. W ithout loss of generality , suppose S = { 1 , . . . , k } . W e con- struct U, V , s as follows. Let the i -th row of U be α ( i ) for i ∈ [ k ] , and let s = − (1 − 2 ρ 0 ) · 1 where 1 denotes the all 1’ s vector . Let the i -column of V be 1 k α ( i ) k 2 − (1 − 2 ρ 0 ) ( β ( i ) − α ( i ) ) for i ∈ [ k ] . Next we verify that the correctness of the construction. W e ﬁrst consider 1 ≤ i ≤ k . W e have that U α ( i ) is a a vector with i -th coordinate equal to k α ( i ) k 2 ≥ 1 − ρ 0 . The j -th coordinate of U α ( i ) is equal to h α ( j ) , α ( i ) i , which can be upperbounded using the assumption of the Lemma by h α ( j ) , α ( i ) i = 1 2  k α ( i ) k 2 + k α ( j ) k 2  − k α ( i ) − α ( j ) k 2 ≤ 1 + ρ 0 − 3 ρ 0 ≤ 1 − 2 ρ 0 . (B.1) Therefore, this means U α ( i ) − (1 − 2 ρ 0 ) · 1 contains a single positi ve entry (with v alue at least k α ( i ) k 2 − (1 − 2 ρ 0 ) ≥ ρ 0 ), and all other entries being non-positi ve. This means that ReLu ( U α ( i ) + b ) =  k α ( i ) k 2 − (1 − 2 ρ 0 )  e i where e i is the i -th natural basis vector . It follows that V ReLu ( U α ( i ) + b ) = ( k α ( i ) k 2 − (1 − 2 ρ 0 )) V e i = β ( i ) − α ( i ) . Finally , consider n ≥ i > k . Then similarly to the computation in equation (B.1), U α ( i ) is a vector with all coordinates less than 1 − 2 ρ 0 . Therefore U α ( i ) + b is a vector with negati ve entries. Hence we have ReLu ( U α ( i ) + b ) = 0 , which implies V ReLu ( U α ( i ) + b ) = 0 . Now we are ready to state the formal v ersion of Lemma 3.3. Lemma B.2. Suppose a sequence of n vectors z (1) , . . . , z ( n ) satisﬁes a r elaxed ver sion of Assumption 3.1: a) for e very i , 1 − ρ 0 ≤ k z ( i ) k 2 ≤ 1 + ρ 0 b) for every i 6 = j , we 17 have k z ( i ) − z ( j ) k 2 ≥ ρ 0 ; . Let v (1) , . . . , v ( n ) be deﬁned abo ve. Then there exists weigh matrices ( A 1 , B 1 ) , . . . , ( A ` , B ` ) , such that given ∀ i, h ( i ) 0 = z ( i ) , we have, ∀ i ∈ { 1 , . . . , n } , h ( i ) ` = v ( i ) . W e will use Lemma B.1 repeatedly to construct building blocks T A j ,B k ,s j ( · ) , and thus prov e Lemma B.2. Each building block T A j ,B k ,s j ( · ) takes a subset of k vectors among { z (1) , . . . , z ( n ) } and con vert them to v ( i ) ’ s, while maintaining all other vectors as ﬁx ed. Since they are totally n/k layers, we ﬁnally maps all the z ( i ) ’ s to the tar get vectors v ( i ) ’ s. Pr oof of Lemma B.2. W e use Lemma B.1 repeatedly . Let S 1 = [1 , . . . , k ] . Then using Lemma B.1 with α ( i ) = z ( i ) and β ( i ) = v ( i ) for i ∈ [ n ] , we obtain that there exists A 1 , B 1 , b 1 such that for i ≤ k , it holds that h ( i ) 1 = z ( i ) + T A 1 ,B 1 ,b 1 ( z ( i ) ) = v ( i ) , and for i ≥ k , it holds that h ( i ) 1 = z ( i ) + T A 1 ,B 1 ,b 1 ( z ( i ) ) = z ( i ) . Now we construct the other layers inductively . W e will construct the layers such that the hidden variable at layer j satisﬁes h ( i ) j = v ( i ) for ev ery 1 ≤ i ≤ j k , and h ( i ) j = z ( i ) for e very n ≥ i > j k . Assume that we have constructed the ﬁrst j layer and next we use Lemma B.1 to construct the j + 1 layer . Then we argue that the choice of α (1) = v (1) , . . . , α ( j k ) = v ( j k ) , α ( j k +1) = z ( j k +1) , . . . , α ( n ) = z ( n ) , and S = { j k + 1 , . . . , ( j + 1) k } satisﬁes the assumption of Lemma B.1. Indeed, because q i ’ s are chosen uniformly randomly , we have w .h.p for every s and i , h q s , z ( i ) i ≤ 1 − ρ 0 . Thus, since v ( i ) ∈ { q 1 , . . . , q r } , we hav e that v ( i ) also doesn’t correlate with any of the z ( i ) . Then we apply Lemma B.1 and conclude that there e xists A j +1 = U, B j +1 = V , b j +1 = s such that T A j +1 ,b j +1 ,b j +1 ( v ( i ) ) = 0 for i ≤ j k , T A j +1 ,b j +1 ,b j +1 ( z ( i ) ) = v ( i ) − z ( i ) for j k < i ≤ ( j + 1) k , and T A j +1 ,b j +1 ,b j +1 ( z ( i ) ) = 0 for n ≥ i > ( j + 1) k . These imply that h ( i ) j +1 = h ( i ) j + T A j +1 ,b j +1 ,b j +1 ( v ( i ) ) = v ( i ) ∀ 1 ≤ i ≤ j k h ( i ) j +1 = h ( i ) j + T A j +1 ,b j +1 ,b j +1 ( z ( i ) ) = v ( i ) ∀ j k + 1 ≤ i ≤ ( j + 1) k h ( i ) j +1 = h ( i ) j + T A j +1 ,b j +1 ,b j +1 ( z ( i ) ) = z ( i ) ∀ ( j + 1) k < i ≤ n Therefore we constructed the j + 1 layers that meets the inducti ve hypothesis for layer j + 1 . Therefore, by induction we get all the layers, and the last layer satisﬁes that h ( i ) ` = v ( i ) for ev ery example i . Now we ready to prove Theorem 3.2, following the general plan sketched in Sec- tion 3. Pr oof of Theor em 3.2. W e use formalize the intuition discussed below Theorem 3.2. First, take k = c (log n ) /ρ 2 for suf ﬁciently large absolute constant c (for example, c = 10 works), by Johnson-Lindenstrauss Theorem ([10], or see [17]) we hav e that when A 0 is a random matrix with standard normal entires, with high probability , all the pairwise distance between the the set of vectors { 0 , x (1) , . . . , x ( n ) } are preserved up to 1 ± ρ/ 3 factor . That is, we ha ve that for e very i , 1 − ρ/ 3 ≤ k A 0 x ( i ) k ≤ 1 + ρ/ 3 , 18 and for e very i 6 = j , k A 0 x ( i ) − A 0 x ( j ) k ≥ ρ (1 − ρ/ 3) ≥ 2 ρ/ 3 . Let z ( i ) = A 0 x ( i ) and ρ 0 = ρ/ 3 . Then we hav e z ( i ) ’ s satisfy the condition of Lemam B.2. W e pick r random vectors q 1 , . . . , q r in R k . Let v (1) , . . . , v ( n ) be deﬁned as in equation (3.2). Then by Lemma B.2, we can construct matrices ( A 1 , B 1 ) , . . . , ( A ` , B ` ) such that h ( i ) ` = v ( i ) . (B.2) Note that v ( i ) ∈ { q 1 , . . . , q r } , and q i ’ s are random unit vector . Therefore, the choice of α (1) = q 1 , . . . , α ( r ) = q r , β (1) = e 1 , . . . , β ( r ) = e r , and satisﬁes the condition of Lemma B.1, and using Lemma B.1 we conclude that there exists A ` +1 , B ` +1 , s ` +1 such that e j = T A ` +1 ,B ` +1 ,b ` +1 ( v j ) , for ev ery j ∈ { 1 , . . . , r } .. (B.3) By the deﬁnition of v ( i ) in equation (3.2) and equation (B.2), we conclude that ˆ y ( i ) = h ( i ) ` + T A ` +1 ,B ` +1 ,b ` +1 ( h ( i ) ` ) = y ( i ) . , which complete the proof. C T oolbox In this section, we state two folklore linear algebra statements. The following Claim should be known, but we can’t ﬁnd it in the literature. W e provide the proof here for completeness. Claim C.1. Let U ∈ R d × d be a r eal normal matrix (that is, it satisﬁes U U > = U > U ). Then, ther e exists an orthonormal matrix S ∈ R d × d such that U = S D S > , wher e D is a r eal block diagonal matrix that consists of blocks with size at most 2 × 2 . Pr oof. Since U is a normal matrix, it is unitarily diagonalizable (see [16] for back- grounds). Therefore, there exists unitary matrix V in C d × d and diagonal matrix in C d × d such that U has eigen-decomposition U = V Λ V ∗ . Since U itself is a real matrix, we have that the eigen values (the diagonal entries of Λ ) come as conjugate pairs, and so do the eigenv ectors (which are the columns of V ). That is, we can group the columns of V into pairs ( v 1 , ¯ v 1 ) , . . . , ( v s , ¯ v s ) , v s +1 , . . . , v t , and let the corresponding eigenv al- ues be λ 1 , ¯ λ 1 , . . . , λ λ s , ¯ λ s , λ s +1 , . . . , λ t . Here λ s +1 , . . . , λ t ∈ R . Then we get that U = P s i =1 2 < ( v i λ i v ∗ i ) + P t i = s +1 v i λ i v > i . Let Q i = < ( v i λ i v ∗ i ) , then we ha ve that Q i is a real matrix of rank-2. Let S i ∈ R d × 2 be a orthonormal basis of the column span of Q i and then we have that Q i can be written as Q i = S i D i S > i where D i is a 2 × 2 matrix. Finally , let S = [ S 1 , . . . , S s , v s +1 , . . . , v t ] , and D = diag( D 1 , . . . , D s , λ s +1 , . . . , λ t ) we complete the proof. The following Claim is used in the proof of Theorem 2.2. W e provide a proof here for completeness. 19 Claim C.2 (folklore) . F or any two matrices A, B ∈ R d × d , we have that k AB k F ≥ σ min ( A ) k B k F . Pr oof. Since σ min ( A ) 2 is the smallest eigen value of A > A , we hav e that B > A > AB  B > · σ min ( A ) 2 I · B . Therefore, it follows that k AB k 2 F = tr ( B > A > AB ) ≥ tr ( B > · σ min ( A ) 2 I · B ) = σ min ( A ) 2 tr ( B > B ) = σ min ( A ) 2 k B k 2 F . T aking square root of both sides completes the proof. 20

Identity Matters in Deep Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment