Robust Large Margin Deep Neural Networks

The generalization error of deep neural networks via their classification margin is studied in this work. Our approach is based on the Jacobian matrix of a deep neural network and can be applied to networks with arbitrary non-linearities and pooling …

Authors: Jure Sokolic, Raja Giryes, Guillermo Sapiro

Robust Large Margin Deep Neural Networks
1 Rob ust Lar ge Mar gin Deep Neural Networks Jure Sokoli ´ c, Student Member , IEEE, Raja Giryes, Member , IEEE, Guillermo Sapiro, F ellow , IEEE, and Miguel R. D. Rodrigues, Senior Member , IEEE Abstract The generalization error of deep neural networks via their classification margin is studied in this work. Our approach is based on the Jacobian matrix of a deep neural network and can be applied to networks with arbitrary non-linearities and pooling layers, and to networks with different architectures such as feed forward networks and residual networks. Our analysis leads to the conclusion that a bounded spectral norm of the network’ s Jacobian matrix in the neighbourhood of the training samples is crucial for a deep neural network of arbitrary depth and width to generalize well. This is a significant improvement ov er the current bounds in the literature, which imply that the generalization error grows with either the width or the depth of the network. Moreover , it shows that the recently proposed batch normalization and weight normalization re-parametrizations enjoy good generalization properties, and leads to a nov el network regularizer based on the network’ s Jacobian matrix. The analysis is supported with experimental results on the MNIST , CIF AR-10, LaRED and ImageNet datasets. Index T erms Deep learning, deep neural networks, generalization error, robustness. I . I N T RO D U C T I O N In recent years, deep neural networks (DNNs) achieved state-of-the-art results in image recognition, speech recognition and many other applications [1]–[4]. DNNs are constructed as a series of non-linear J. Sokoli ´ c and M. R. D. Rodrigues are with the Department of Electronic and Electrical Engineering, Univeristy College London, London, UK (e-mail: {jure.sokolic.13, m.rodrigues}@ucl.ac.uk ). R. Giryes is with the School of Electrical Engineering, Faculty of Engineering, T el-A viv Univ ersity , T el A viv , Israel (e-mail: raja@tauex.tau.ac.il ). G. Sapiro is with the Department of Electrical and Computer Engineering, Duke Uni versity , NC, USA (e-mail: guillermo.sapiro@duke.edu ). The work of Jure Sok oli ´ c and Miguel R. D. Rodrigues w as supported in part by EPSRC under grant EP/K033166/1. The work of Raja Giryes was supported in part by GIF , the German-Israeli Foundation for Scientific Research and Development. The work of Guillermo Sapiro was supported in part by NSF , ONR, AR O, and NGA. May 24, 2017 DRAFT 2 signal transformations that are applied sequentially , where the parameters of each layer are estimated from the data [3]. T ypically , each layer applies on its input a linear (or affine) transformation followed by a point-wise non-linearity such as the sigmoid function, the hyperbolic tangent function or the Rectified Linear Unit (ReLU) [5]. Many DNNs also include pooling layers, which act as down-sampling operators and may also pro vide in v ariance to v arious input transformations such as translation [6], [7]. They may be linear , as in av erage pooling, or non-linear , as in max-pooling. There were various attempts to provide a theoretical foundation for the representation power , optimization and generalization of DNNs. For example, the works in [8], [9] showed that neural networks with a single hidden layer – shallow networks – can approximate any measurable Borel function. On the other hand, it was sho wn in [10] that a deep network can divide the space into an exponential number of sets, which can not be achiev ed by shallow networks that use the same number of parameters. Similarly , the authors in [11] conclude that functions implemented by DNNs are exponentially more expressiv e than functions implemented by shallo w netw orks. The work in [12] shows that for a gi ven number of parameters and a gi ven depth, there always exists a DNN that can be approximated by a shallo wer network only if the number of parameters in the shallow network is e xponential in the number of layers of the deep network. Scattering transform - a conv olutional DNN lik e transform, which is based on the w avelet transform and pointwise non-linearities - provides insights into translation in v ariance and stability to deformations of con volutional DNNs [13]–[15]. DNNs with random weights are studied in [16], where it is shown that such networks perform distance preserving embedding of low-dimensional data manifolds. The authors in [17] model a loss function of DNN with a spin-glass model and show that for lar ge networks the local optima of the loss function are close to the global optima. Optimization aspects of DNNs are studied from the perspecti ve of tensor factorization in [18] where it is shown that if a network is large, then it is possible to find the global minima from any initialization with a gradient descent algorithm. The role of DNNs in improving con vergence speed of v arious iterativ e algorithms is studied in [19]. Optimization dynamics of a deep linear netw ork is studied in [20], where it is sho wn that the learning speed of deep networks may be independent of their depth. Reparametrization of DNN for more ef ficient learning is studied in depth in [21]. A modified v ersion of stochastic gradient descent for optimization of DNNs that are in variant to weight rescaling in different layers is proposed in [22], where it is shown that such an optimization may lead to a smaller generalization error (GE) - the dif ference between the empirical error and the expected error , than the one achiev ed with the classical stochastic gradient descent. The authors in [23] propose the batch normalization – a technique that normalizes the output of each layer and leads to faster training and also a smaller GE. A similar technique based on normalization of May 24, 2017 DRAFT 3 the weight matrix ro ws is proposed in [24]. It is shown empirically that such reparametrization leads to a f aster training and a smaller GE. Learning of DNN by bounding the spectral norm of the weight matrices is proposed in [25]. Other methods for DNN regularization include weight decay , dropout [26], constraining the Jacobian matrix of encoder for regularization of auto-encoders [27], and enforcing a DNN to be a partial isometry [28]. An important theoretical aspect of DNNs is the effect of their architecture, e.g. depth and width, on their GE. V arious measures such as the VC-dimension [29], [30], the Rademacher or Gaussian comple xities [31] and algorithmic robustness [32] hav e been used to bound the GE in the context of DNNs. For example, the VC-dimension of DNN with the hard-threshold non-linearity is equal to the number of parameters in the network, which implies that the sample complexity is linear in the number of parameters of the netw ork. The GE can also be bounded independently of the number of parameters, provided that the norms of the weight matrices (the network’ s linear components) are constrained appropriately . Such constraints are usually enforced by training networks with weight decay regularization, which is simply the ` 1 - or ` 2 -norm of all the weights in the network. F or e xample, the work [33] studies the GE of DNN with ReLUs and constraints on the norms of the weight matrices. Howe ver , it provides GE bounds that scale exponentially with the network depth. Similar behaviour is also depicted in [34]. The authors in [32] show that DNNs are robust provided that the ` 1 -norm of the weights in each layer is bounded. The bounds are e xponential in the ` 1 -norm of the weights if the norm is greater than 1. The GE bounds in [30], [32], [33] suggest that the GE of a DNN is bounded only if the number of training samples gro ws with the DNN depth or size. Ho wever , in practice increasing netw ork’ s depth or size often leads to a lower GE [4], [35]. Moreo ver , recent w ork in [36] shows that a 2 layer DNN with ReLUs may fit any function of n samples in d dimensions provided that it has 2 n + d parameters, which is often the case in practice. They show that the nature of the GE depends more on the nature of the data than on the architecture of the network as the same network is able to fit both structured data and random data, where for the first the GE is very low and for the latter it is v ery lar ge. The authors conclude that data agnostic measures such as the Rademacher complexity or VC-dimension are not adequate to explain the good generalization properties of modern DNN. Our work complements the pre vious works on the GE of DNNs by bounding the GE in terms of the DNN classification mar gin, which is independent of the DNN depth and size, but takes into account the structure of the data (considering its covering number) and therefore avoids the issues presented abo ve. The extension of our results to in variant DNN is pro vided in [37]. May 24, 2017 DRAFT 4 A. Contrib utions In this work we focus on the GE of a multi-class DNN classifier with general non-linearities. W e establish ne w GE bounds of DNN classifiers via their classification margin, i.e. the distance between the training sample and the non-linear decision boundary induced by the DNN classifier in the sample space. The work capitalizes on the algorithmic robustness frame work in [32] to cast insight onto the generalization properties of DNNs. In particular , the use of this framew ork to understand the operation of DNNs in volves various innov ations, which include: • W e deriv e bounds for the GE of DNNs by lower bounding their classification margin. The lower bound of the classification mar gin is expressed as a function of the network’ s Jacobian matrix. • Our approach includes a lar ge class of DNNs. F or example, we consider DNNs with the softmax layer at the network output; DNNs with various non-linearities such as the Rectified Linear Unit (ReLU), the sigmoid and the hyperbolic tangent; DNNs with pooling, such as do wn-sampling, av erage pooling and max-pooling; and networks with shortcut connections such as Residual Networks [4]. • Our analysis shows that the GE of a DNN can be bounded independently of its depth or width provided that the spectral norm of the Jacobian matrix in the neighbourhood of the training samples is bounded. W e ar gue that this result giv es a justification for a low GE of DNNs in practice. Moreov er , it also pro vides an explanation for wh y training with the recently proposed weight normalization or batch normalization can lead to a small GE. In such networks the ` 2 -norm of the weight matrices is fixed and ` 2 -norm regularization does not apply . The analysis also leads to a novel Jacobian matrix-based regularizer , which can be applied to weight normalized or batch normalized networks. • W e pro vide a series of examples on the MNIST , CIF AR-10, LaRED and ImageNet datasets that v alidate our analysis and demonstrate the effecti veness of the Jacobian regularizer . Our contributions differ from the existing works in man y ways. In particular , the GE of DNNs has been studied via the algorithmic rob ustness framework in [32]. Their bounds are based on the per -unit ` 1 -norm of the weight matrices, and the studied loss is not relev ant for classification. Our analysis is much broader , as it aims at bounding the GE of 0-1 loss directly and also considers DNNs with pooling. Moreov er , our bounds are a function of the network’ s Jacobian matrix and are tighter than the bounds based on the norms of the weight matrices. The w ork in [28] shows that learning transformations that are locally isometric is rob ust and leads to a small GE. Though they apply the proposed technique to DNNs they do not sho w ho w the DNN architecture affects the GE as our work does. The authors in [25] ha ve observed that contracti ve DNNs with ReLUs trained with the hinge loss lead May 24, 2017 DRAFT 5 to a large classification margin. Howe ver , they do not provide any GE bounds. Moreov er, their results are limited to DNNs with ReLUs, whereas our analysis holds for arbitrary non-linearities, DNNs with pooling and DNNs with the softmax layer . The work in [27] is related to ours in the sense that it proposes to regularize auto-encoders by constraining the Frobenious norm of the encoder’ s Jacobian matrix. Ho wever , their work is more empirical and is less concerned with the classification margin or GE bounds. They use the Jacobian matrix to regularize the encoder whereas we use the Jacobian matrix to regularize the entire DNN. Finally , our DNN analysis, which is based on the network’ s Jacobian matrix, is also related to the concept of sensitivity analysis that has been applied to feature selection for SVM and neural netw orks [38], [39], and for the construction of radial basis function networks [40], since the spectral norm of the Jacobian matrix quantifies the sensiti vity of DNN output with respect to the input perturbation. B. P aper or ganization Section II introduces the problem of generalization error , including elements of the algorithmic rob ustness frame work, and introduces DNN classifiers. Properties of DNNs are described in Section III. The bounds on the classification margin of DNNs and their implication for the GE of DNNs are discussed in Section IV. Generalizations of our results are discussed in Section V. Section VI presents experimental results. The paper is concluded in Section VII. The proofs are deferred to the Appendix. C. Notation W e use the following notation in the sequel: matrices, column v ectors, scalars and sets are denoted by boldface upper-case letters ( X ), boldface lower -case letters ( x ), italic letters ( x ) and calligraphic upper-case letters ( X ), respectiv ely . The con v ex hull of X is denoted by con v ( X ) . I N ∈ R N × N denotes the identity matrix, 0 M × N ∈ R M × N denotes the zero matrix and 1 N ∈ R N denotes the vector of ones. The subscripts are omitted when the dimensions are clear from the conte xt. e k denotes the k -th basis vector of the standard basis in R N . k x k 2 denotes the Euclidean norm of x , k X k 2 denotes the spectral norm of X , and k X k F denotes the Frobenious norm of X . The i -th element of the vector x is denoted by ( x ) i , and the element of the i -th ro w and j -th column of X is denoted by ( X ) ij . The cov ering number of X with d -metric balls of radius ρ is denoted by N ( X ; d, ρ ) . I I . P RO B L E M S TA T E M E N T W e start by describing the GE in the framew ork of statistical learning. Then, we dwell on the GE bounds based on the robustness framew ork by Xu and Manor [32]. Finally , we present the DNN architectures studied in this paper . May 24, 2017 DRAFT 6 A. The Classification Pr oblem and Its GE W e consider a classification problem, where we observe a vector x ∈ X ⊆ R N that has a corresponding class label y ∈ Y . The set X is called the input space, Y = { 1 , 2 , . . . , N Y } is called the label space and N Y denotes the number of classes. The samples space is denoted by S = X × Y and an element of S is denoted by s = ( x , y ) . W e assume that samples from S are drawn according to a probability distrib ution P defined on S . A training set of m samples drawn from P is denoted by S m = { s i } m i =1 = { ( x i , y i ) } m i =1 . The goal of learning is to lev erage the training set S m to find a classifier g ( x ) that provides a label estimate ˆ y gi ven the input vector x . In this work the classifier is a DNN, which is described in detail in Section II-C. The quality of the classifier output is measured by the loss function ` ( g ( x ) , y ) , which measures the discrepancy between the true label y and the estimated label ˆ y = g ( x ) provided by the classifier . Here we take the loss to be the 0-1 indicator function. Other losses such as the hinge loss or the categorical cross entropy loss are possible. The empirical loss of the classifier g ( x ) associated with the training set and the e xpected loss of the classifier g ( x ) are defined as ` emp ( g ) = 1 /m X s i ∈ S m ` ( g ( x i ) , y i ) (1) and ` exp ( g ) = E s ∼ P [ ` ( g ( x ) , y )] , (2) respecti vely . An important question, which occupies us throughout this work, is how well l emp ( g ) predicts l exp ( g ) . The measure we use for quantifying the prediction quality is the dif ference between l exp ( g ) and l emp ( g ) , which is called the gener alization err or : GE ( g ) = | ` exp ( g ) − ` emp ( g ) | . (3) B. The Algorithmic Robustness F rame work In order to provide bounds to the GE for DNN classifiers we lev erage the rob ustness frame work [32], which is described next. The algorithmic robustness framework provides bounds for the GE based on the robustness of a learning algorithm that learns a classifier g lev eraging the training set S m : May 24, 2017 DRAFT 7 Definition 1 ( [32]) . Let S m be a training set and S the sample space. A learning algorithm is ( K,  ( S m )) - r obust if the sample space S can be partitioned into K disjoint sets denoted by K k , k = 1 , . . . , K , K k ⊆ S , k = 1 , . . . , K , (4) S = ∪ K k =1 K k , (5) K k ∩ K k 0 = ∅ , ∀ k 6 = k 0 , (6) such that for all s i ∈ S m and all s ∈ S s i = ( x i , y i ) ∈ K k ∧ s = ( x , y ) ∈ K k = ⇒ | ` ( g ( x i ) , y i ) − ` ( g ( x ) , y ) | ≤  ( S m ) . (7) Note that s i is an element of the training set and s is an arbitrary element of the sample space S . Therefore, a rob ust learning algorithm chooses a classifier g for which the losses of any s and s i in the same partition K k are close. The following theorem provides the GE bound for robust algorithms. 1 Theorem 1 (Theorem 3 in [32]) . If a learning algorithm is ( K,  ( S m )) -r obust and ` ( g ( x ) , y ) ≤ M for all s = ( x , y ) ∈ S , then for any δ > 0 , with pr obability at least 1 − δ , GE ( g ) ≤  ( S m ) + M r 2 K log (2) + 2 log(1 /δ ) m . (8) The first term in the GE bound in (8) is constant and depends on the training set S m . The second term behav es as O (1 / √ m ) and vanishes as the size of the training set S m approaches infinity . M = 1 in the case of 0-1 loss, and K corresponds to the number of partitions of the samples space S . A bound on the number of partitions K can be found by the cov ering number of the samples space S . The cov ering number is the smallest number of (pseudo-)metric balls of radius ρ needed to cover S , and it is denoted by N ( S ; d, ρ ) , where d denotes the (pseudo-)metric. 2 The space S is the Cartesian product of a continuous input space X and a discrete label space Y , and we can write N ( S ; d, ρ ) ≤ N Y · N ( X ; d, ρ ) , where N Y corresponds to the number of classes. The choice of metric d determines how ef ficiently one may cover X . A common choice is the Euclidean metric d ( x , x 0 ) = k x − x 0 k 2 , , x , x 0 ∈ X , (9) which we also use in this paper . The cov ering number of many structured low-dimensional data models can be bounded in terms of their “intrinsic” properties, for example: 1 Additional variants of this theorem are provided in [32]. 2 Note that we can always obtain a set of disjoint partitions from the set of metric balls used to construct the covering. May 24, 2017 DRAFT 8 • a Gaussian mixture model (GMM) with L Gaussians and co variance matrices of rank at most k leads to a covering number N ( X ; d, ρ ) = L (1 + 2 /ρ ) k [41]; • k -sparse signals in a dictionary with L atoms have a covering number N ( X ; d, ρ ) =  L k  (1 + 2 /ρ ) k [16]; • C M regular k -dimensional manifold, where C M is a constant that captures its “intrinsic” properties, has a co vering number N ( X ; d, ρ ) =  C M ρ  k [42]. 1) Lar ge Mar gin Classifier: An example of a robust learning algorithm is the lar ge mar gin classifiers, which we consider in this work. The classification mar gin is defined as follows: Definition 2 (Classification margin) . The classification mar gin of a training sample s i = ( x i , y i ) measur ed by a metric d is defined as γ d ( s i ) = sup { a : d ( x i , x ) ≤ a = ⇒ g ( x ) = y i ∀ x } . (10) The classification mar gin of a training sample s i is the radius of the largest metric ball (induced by d ) in X centered at x i that is contained in the decision region associated with class label y i . The rob ustness of large margin classifiers is gi ven by the following Theorem. Theorem 2 (Adapted from Example 9 in [32]) . If there e xists γ such that γ d ( s i ) > γ > 0 ∀ s i ∈ S m , (11) then the classifier g ( x ) is ( N Y · N ( X ; d, γ / 2) , 0) -rob ust. Theorems 1 and 2 imply that the GE of a classifier with mar gin γ is upper bounded by (neglecting the log(1 /δ ) term in (8)) GE ( g ) . 1 √ m p 2 log (2) · N Y · N ( X ; d, γ / 2) . (12) Note that in case of a large margin classifier the constant  ( S m ) in (8) is equal to 0, and the GE approaches zero at a rate √ m as the number of training samples grows. The GE also increases sub-linearly with the number of classes N Y . Finally , the GE depends on the complexity of the input space X and the classification margin via the co vering number N ( X ; d, γ / 2) . For example, if we tak e X to be a C M regular k -dimensional manifold then the upper bound to the GE behav es as: May 24, 2017 DRAFT 9 x φ 1 ( x , θ 1 ) φ 2 ( z 1 , θ 2 ) φ L ( z L − 1 , θ L ) z z 1 Fig. 1. DNN transforms the input vector x to the feature vector z by a series of (non-linear) transforms. Corollary 1. Assume that X is a (subset of) C M r e gular k -dimensional manifold, where N ( X ; d, ρ ) ≤  C M ρ  k . Assume also that classifier g ( x ) achie ves a classification mar gin γ and take ` ( g ( x i ) , y i ) to be the 0-1 loss. Then for any δ > 0 , with pr obability at least 1 − δ , GE ( g ) ≤ s log(2) · N Y · 2 k +1 · ( C M ) k γ k m + r 2 log (1 /δ ) m . (13) Pr oof: The proof follows dir ectly fr om Theorems 1 and 2. Note that the role of the classifier is captured via the achiev ed classification margin γ . If we can always ensure a classification margin γ = 1 , then the GE bound only depends on the dimension of the manifold k and the manifold constant C M . W e relate this bound, in the context of DNNs, to other bounds in the literature in Section IV. C. Deep Neural Network Classifier The DNN classifier is defined as g ( x ) = arg max i ∈ [ N Y ] ( f ( x )) i , (14) where ( f ( x )) i is the i -th element of the N Y dimensional output of a DNN f : R N → R N Y . W e assume that f ( x ) is composed of L layers: f ( x ) = φ L ( φ L − 1 ( · · · φ 1 ( x , θ 1 ) , · · · θ L − 1 ) , θ L ) , (15) where φ l ( · , θ l ) represents the l -th layer with parameters θ l , l = 1 , . . . , L . The output of the l -th layer is denoted by z l , i.e. z l = φ l ( z l − 1 , θ l ) , z l ∈ R M l ; the input layer corresponds to z 0 = x ; and the output of the last layer is denoted by z = f ( x ) . Such a DNN is visualized in Fig. 1. Next, we define various layers φ l ( · , θ l ) that are used in the modern state-of-the-art DNNs. 1) Linear and Softmax Layers: W e start by describing the last layer of a DNN that maps the output of pre vious layer into R N Y , where N Y corresponds to the number of classes. 3 This layer can be linear: z = ˆ z , ˆ z = W L z L − 1 + b L , (16) 3 Assuming that there are N Y one-vs.-all classifiers. May 24, 2017 DRAFT 10 T ABLE I P O I N T - W I S E N O N - L I N E A R I T I E S Name Function: σ ( x ) Deriv ati ve: d dx σ ( x ) Deriv ati ve bound: sup x   d dx σ ( x )   ReLU max( x, 0) { 1 if x > 0; 0 if x ≤ 0 } ≤ 1 Sigmoid 1 1+ e − x σ ( x )(1 − σ ( x )) = e − x (1+ e − x ) 2 ≤ 1 4 Hyperbolic tangent tanh( x ) = e x − e − x e x + e − x 1 − σ ( x ) 2 ≤ 1 where W L ∈ R N Y × M L − 1 is the weight matrix associated with the last layer and b ∈ R N Y is the bias vector associated with the last layer . Note that according to (14) , the i -th row of W L can be interpreted as a normal to the hyperplane that separates class i from the others. If the last layer is linear the usual choice of learning objective is the hinge loss. A more common choice for the last layer is the softmax layer: z = ζ ( ˆ z ) = e ˆ z /  1 T e ˆ z  , ˆ z = W L z L − 1 + b L , (17) where ζ ( · ) is the softmax function and W L and b L are the same as in (16) . Note that the e xponential is applied element-wise. The elements of z are in range (0 , 1) and are often interpreted as “probabilites” associated with the corresponding class labels. The decision boundary between class y 1 and class y 2 corresponds to the hyperplane { z : ( z ) y 1 = ( z ) y 2 } . The softmax layer is usually coupled with categorical cross-entropy training objectiv e. For the remainder of this work we will take the softmax layer as the last layer of DNN, b ut note that all results still apply if the linear layer is used. 2) Non-linear layers: A non-linear layer is defined as z l = [ ˆ z l ] σ = [ W l z l − 1 + b l ] σ , (18) where [ ˆ z l ] σ represents the element-wise non-linearity applied to each element of ˆ z l ∈ R M l , and ˆ z l represents the linear transformation of the layer input: ˆ z l = W l z l − 1 + b l . W l ∈ R M l × M l − 1 is the weight matrix and b l ∈ R M l is the bias vector . The typical non-linearities are the ReLU, the sigmoid and the hyperbolic tangent. They are listed in T able I. The choice of non-linearity σ is usually the same for all the layers in the network. Note that the non-linear layer in (18) includes the con volutional layers which are used in the con volutional neural networks. In that case the weight matrix is block-cyclic. May 24, 2017 DRAFT 11 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 ( x ) 1 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 ( x ) 2 γ d ( s i ) Class 1 Class 2 (a) Input space. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ( z ) 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 ( z ) 2 Class 1 Class 2 (b) Output space. Fig. 2. Decision boundaries in the input space and in the output space. Plot (a) sho ws samples of class 1 and 2 and the decision regions produced by a two-layer network projected into the input space. Plot (b) shows the samples transformed by the network and the corresponding decision boundary at the network output. 3) P ooling layers: A pooling layer reduces the dimension of intermediate representation and is defined as z l = P l ( z l − 1 ) z l − 1 , (19) where P l ( z l − 1 ) is the pooling matrix. The usual choices of pooling are down-sampling, max-pooling and av erage pooling. W e denote by p l i ( z l − 1 ) the i -th row of P l ( z l − 1 ) and assume that there are M l pooling regions P i , i = 1 , . . . , M l . In the case of do wn-sampling p l i ( z l − 1 ) = e P i (1) , where P i (1) is the first element of the pooling region P i ; in the case of max-pooling p l i ( z l − 1 ) = e j ? , where j ? = arg max j 0 ∈P i | ( z l − 1 ) j 0 | ; and in the case of av erage pooling p l i ( z l − 1 ) = 1 |P i | P j ∈P i e j . I I I . T H E G E O M E T R I C A L P RO P E RT I E S O F D E E P N E U R A L N E T W O R K S The classification margin introduced in Section II-A is a function of the decision boundary in the input space. This is visualized in Fig. 2 (a). Ho wev er , a training algorithm usually optimizes the decision boundary at the network output (Fig. 2 (b)), which does not necessarily imply a large classification margin. In this section we introduce a general approach that allo ws us to bound the expansion of distances between the network input and its output. In Section IV we use this to establish bounds of the classification margin and the GE bounds that are independent of the network depth or width. W e start by defining the Jacobian matrix (JM) of the DNN f ( x ) : J ( x ) = d f ( x ) d x = L Y l =1 dφ l ( z l − 1 ) d z l − 1 · dφ 1 ( x ) d x . (20) May 24, 2017 DRAFT 12 Note that by the properties of the chain rule, the JM is computed as the product of the JMs of the indi vidual network layers, e v aluated at the appropriate values of the layer inputs x , z 1 , . . . , z L − 1 . W e use the JM to establish a relation between a pair of vectors in the input space and the output space. Theorem 3. F or any x , x 0 ∈ X and a DNN f ( · ) , we have f ( x 0 ) − f ( x ) = Z 1 0 J ( x + t ( x 0 − x )) dt ( x 0 − x ) (21) = J x , x 0 ( x 0 − x ) , (22) wher e J x , x 0 = Z 1 0 J ( x + t ( x 0 − x )) dt (23) is the aver age J acobian on the line se gment between x and x 0 . Pr oof: The proof appears in Appendix A. As a direct consequence of Theorem 3 we can bound the distance expansion between x and x 0 at the output of the network f ( · ) : Corollary 2. F or any x , x 0 ∈ X and a DNN f ( · ) , we have k f ( x 0 ) − f ( x ) k 2 = k J x , x 0 ( x 0 − x ) k 2 ≤ sup x 00 ∈ con v ( X ) k J ( x 00 ) k 2 k x 0 − x k 2 . (24) Pr oof: The proof appears in Appendix B. Note that we ha ve established that J x , x 0 corresponds to a linear operator that maps the v ector x 0 − x to the vector f ( x 0 ) − f ( x ) . This implies that the maximum distance expansion of the netw ork f ( x ) is bounded by the maximum spectral norm of the network’ s JM. Moreov er , the JM of f ( x ) corresponds to the product of JMs of all the layers of f ( x ) as shown in (20) . It is possible to calculate the JMs of all the layers defined in Section II-C: 1) J acobian Matrix of Linear and Softmax Layer s: The JM of the linear layer defined in (16) is equal to the weight matrix d z d z L − 1 = W L . (25) Similarly , in the case of softmax layer defined in (17) the JM is d z d z L − 1 = d z d ˆ z · d ˆ z d z L − 1 =  − ζ ( ˆ z ) ζ ( ˆ z ) T + diag ( ζ ( ˆ z )  · W L . (26) Note that  − ζ ( ˆ z ) ζ ( ˆ z ) T + diag ( ζ ( ˆ z )  corresponds to the JM of the softmax function ζ ( ˆ z ) . May 24, 2017 DRAFT 13 2) J acobian Matrix of Non-Linear Layers: The JM of the non-linear layer (18) can be deri ved in the same way as the JM of the softmax layer . W e first define the JM of the point-wise non-linearity , which is a diagonal matrix 4  d z l d ˆ z l  ii = dσ  ( ˆ z l ) i  d ( ˆ z l ) i , i = 1 , . . . , M l . (27) The deriv ativ es associated with various non-linearities are pro vided in T able I. The JM of the non-linear layer can be expressed as d z l d ˆ z l − 1 = d z l d ˆ z l · W l . (28) 3) J acobian Matrix of P ooling Layers: The pooling operator defined in (19) is linear or a piece-wise linear operator . The corresponding JM is therefore also linear or piece-wise linear and is equal to: P l ( z l − 1 ) . (29) The following Lemma collects the bounds on the spectral norm of the JMs for all the layers defined in Section II-C. Lemma 1. The following statements hold: 1) The spectral norm of JMs of the linear layer in (16) , the softmax layer in (17) and non-linear layer in (18) with the ReLU , Sigmoid or Hyperbolic tangent non-linearities is upper bounded by     d z l d ˆ z l − 1     2 ≤ k W l k 2 ≤ k W l k F . (30) 2) Assume that the pooling re gions of the down-sampling, max-pooling and averag e pooling operators ar e non-overlapping. Then the spectr al norm of their JMs can be upper bounded by     d z l d ˆ z l − 1     2 ≤ 1 . (31) Pr oof: The proof appears in Appendix C. Lemma 1 sho ws that the spectral norms of all layers can be bounded in terms of their weight matrices. As a consequence, the spectral norm of the JM is bounded by the product of the spectral norms of the weight matrices. W e le verage this facts to pro vide GE bounds in the ne xt section. 4 Note that in case of ReLU the deri vativ e of max ( x, 0) is not defined for x = 0 , and we need to use subderi vati ves (or subgradients) to define the JM. W e av oid this technical complication and simply take the deriv ative of max ( x, 0) to be 0 when x = 0 . Note that this does not change the results in any way because the subset of X for which the deriv ativ es are not defined has zero measure. May 24, 2017 DRAFT 14 W e also briefly explore a relationship between the Jacobian matrix and the Fisher information matrix. T o simplify the deriv ations we assume M = 1 , N = 1 , x 0 = x + θ n and n ∼ N (0 , 1) , where θ is the model parameter and x is deterministic. The Fisher information F ( θ ) measures how much information about the parameter θ is contained in the random variable y = f ( x 0 ) , where f represents a DNN. In this particular case the Fisher information is giv en as F ( θ ) = E n "  d log f ( x 0 ) dθ  2 # = E n "  d log f ( x 0 ) d f ( x 0 ) d f ( x 0 ) d x 0 d x 0 θ  2 # = E n "  d log f ( x 0 ) d f ( x 0 ) J ( x 0 ) n  2 # . (32) In our setup the parameter θ can be interpreted as a magnitude of the input perturbation. It is clear from (32) that a small norm of the Jacobian matrix leads to a small Fisher information, which indicates that the distribution of y is not very informativ e about the parameters θ . By ensuring that the norm of the Jacobian is small we then naturally endow the netw ork with robustness against perturbations of the input. I V . G E N E R A L I Z AT I O N E R R O R O F A D E E P N E U R A L N E T W O R K C L A S S I FI E R In this section we pro vide the classification margin bounds for DNN classifiers that allow us to bound the GE. W e follow the common practice and assume that the networks are trained by a loss that promotes separation of dif ferent classes at the network output, e.g. categorical cross entrop y loss or the hinge loss. In other words, the training aims at maximizing the score of each training sample, where the score is defined as follo ws. Definition 3 (Score) . T ake scor e of a training sample s i = ( x i , y i ) o ( s i ) = min j 6 = y i √ 2( δ y i − δ j ) T f ( x i ) , (33) wher e δ i ∈ R N Y is the Kr onecker delta vector with ( δ i ) i = 1 . Recall the definition of the classifier g ( x ) in (14) and note that the decision boundary between class i and class j in the feature space Z is given by the hyperplane { z : ( z ) i = ( z j ) } . A positiv e score indicates that at the netw ork output, classes are separated by a mar gin that corresponds to the score. Ho we ver , a large score o ( s i ) does not necessarily imply a large classification margin γ d ( s i ) . Theorem 4 provides classification margin bounds expressed as a function of the score and the properties of the network. May 24, 2017 DRAFT 15 Theorem 4. Assume that a DNN classifier g ( x ) , as defined in (14) , classifies a training sample x i with the scor e o ( s i ) > 0 . Then the classification margin can be bounded as γ d ( s i ) ≥ o ( s i ) sup x : k x − x i k 2 ≤ γ d ( s i ) k J ( x ) k 2 , γ d 1 ( s i ) (34) ≥ o ( s i ) sup x ∈ con v ( X ) k J ( x ) k 2 , γ d 2 ( s i ) (35) ≥ o ( s i ) Q W l ∈W k W l k 2 , γ d 3 ( s i ) (36) ≥ o ( s i ) Q W l ∈W k W l k F , γ d 4 ( s i ) , (37) wher e W is the set of all weight matrices of f ( x ) . Pr oof: The proof appears in Appendix D. Gi ven the bounds of the classification mar gin we can specialize Corollary 1 to DNN classifiers. Corollary 3. Assume that X is a (subset of) C M r e gular k -dimensional manifold, where N ( X ; d, ρ ) ≤  C M ρ  k . Assume also that DNN classifier g ( x ) achie ves a lower bound to the classification margin γ d b ( s i ) > γ b for b ∈ { 1 , 2 , 3 , 4 } , ∀ s i ∈ S m and take ` ( g ( x i ) , y i ) to be the 0-1 loss. Then for any δ > 0 , with pr obability at least 1 − δ , GE ( g ) ≤ s log(2) · N Y · 2 k +1 · ( C M ) k γ k b m + r 2 log (1 /δ ) m . (38) Pr oof: The proof follows fr om Theor ems 1, 2 and 4. Corollary 3 suggests that the GE will be bounded by C 1 √ m γ − k/ 2 , where C = p log(2) · N Y 2 k +1 ( C M ) k , provided that the classification mar gin bounds satisfy γ d b ( s i ) > γ for some b ∈ { 1 , 2 , 3 , 4 } , ∀ s i ∈ S m . W e no w le verage the classification margin bounds in Theorem 4 to construct constraint sets W b = { W l ∈ W : γ d b ( s i ) > γ ∀ s i } , b ∈ 1 , 2 , 3 , 4 such that W ∈ W b ensures that the GE is bounded by May 24, 2017 DRAFT 16 C 1 √ m γ − k/ 2 . Using (34)-(37) we obtain W 1 =  W l ∈ W : sup x : k x − x i k 2 ≤ γ d ( s i ) k J ( x ) k 2 < γ · o ( s i ) ∀ s i = ( x i , y i )  , (39) W 2 =  W l ∈ W : sup x ∈ con v ( X ) k J ( x ) k 2 < γ · o ( s i ) ∀ s i = ( x i , y i )  , (40) W 3 =  W l ∈ W : Y W l ∈W k W l k 2 < γ · o ( s i ) ∀ s i = ( x i , y i )  , (41) W 4 =  W l ∈ W : Y W l ∈W k W l k F < γ · o ( s i ) ∀ s i = ( x i , y i )  . (42) Note that while we want to maximize the score o ( s i ) , we also need to constrain the network’ s Jacobian matrix J ( x ) (follo wing W 1 and W 2 ) or the weight matrices W l ∈ W (follo wing W 3 and W 4 ). This stands in line with the common training rationale of DNN in which we do not only aim at maximizing the score of the training samples to ensure a correct classification of the training set, but also hav e a regularization that constrains the network parameters, where this combination e ventually leads to a lower GE. The constraint sets in (39)-(42) impose dif ferent regularization techniques: • The term sup x : k x − x i k 2 ≤ γ d ( s i ) k J ( x ) k 2 < γ · o ( s i ) in (39) considers only the supremum of the spectral norm of the Jacobian matrix e valuated at the points within N i = { x : k x − x i k 2 ≤ γ d ( s i ) } , where γ d ( s i ) is the classification margin of training sample s i (see Definition 2). W e can not compute the margin γ d ( s i ) , but can still obtain a rationale for regularization: as long as the spectral norm of the Jacobian matrix is bounded in the neighbourhood of a training sample x i gi ven by N i we will ha ve the GE guarantees. • The constraint on the Jacobian matrix sup x ∈ con v ( X ) k J ( x ) k 2 < γ · o ( s i ) in (40) is more restrictiv e as it requires bounded spectral norm for all samples x in the con v ex hull of the input space X . • The constraints in (41) and (42) are of similar form, Q W l ∈W k W l k 2 < γ · o ( s i ) and Q W l ∈W k W l k F < γ · o ( s i ) , respectiv ely . Note that the weight decay , which aims at bounding the Frobenious norms of the weight matrices might be used to satisfy the constrains in (42) . Howe ver , note also that the bound based on the spectral norm in (41) is tighter than one based on the Frobenious norm in (42) . For example, take W l ∈ W to have orthonormal rows and be of dimension M × M . Then the constraint in (41) , which is based on the spectral norm, is of the form 1 < γ o ( s i ) and the constraint in (42) , which is based on the Frobenious norm, is M L/ 2 < γ o ( s i ) . In the former case we hav e May 24, 2017 DRAFT 17 a constraint on the score, which is independent of the network width or depth. In the latter the constraint on the output score is exponential in network depth and polynomial in network width. The dif ference is that the Frobenious norm does not take into account the correlation (angles) between the rows of the weight matrix W l , while the spectral norm does. Therefore, the bound based on the Frobenious norm corresponds to the worst case when all the rows of W l are aligned. In that case k W l k F = k W l k 2 = √ M . On the other hand, if the rows of W l are orthonormal k W l k F = √ M , but k W l k 2 = 1 . Remark 1. T o put r esults into perspective we compar e our GE bounds to the GE bounds based on the Rademacher comple xity in [33], which hold for DNNs with ReLUs. The work in [33] shows that if W ∈ W F = { W i ∈ W : L Y i =1 k W i k F < C F } (43) and the ener gy of training samples is bounded then: GE ( g ) / 1 √ m 2 L − 1 C F . (44) Although the bounds (38) and (44) ar e not dir ectly comparable, since the bounds based on the r obustness frame work r ely on an underlying assumption on the data (co vering number), ther e is still a r emarkable differ ence between them. The behaviour in (44) suggests that the GE gr ows exponentially with the network depth even if the pr oduct of the F r obenious norms of all the weight matrices is fixed, which is due to the term 2 L . The bound in (34) and the constraint sets in (39) - (42) , on the other hand, imply that the GE does not increase with the number of layers pr ovided that the spectr al/F r obenious norms of the weight matrices are bounded. Moreo ver , if we take the DNN to have weight matrices with orthonormal r ows then the GE behaves as 1 √ m ( C M ) k/ 2 (assuming o ( s i ) ≥ 1 , i = 1 , . . . , m ), and ther efore r elies only on the complexity of the underlying data manifold and not on the network depth. This pr ovides a possible answer to the open question of [33] that depth independent capacity contr ol is possible in DNNs with ReLUs. Remark 2. An important value of our bounds is that they pr ovide an additional explanation to the success of state-of-the-art DNN training techniques such as batch normalization [23] and eight normalization [24]. W eight normalized DNNs have weight matrices with normalized r ows, i.e. W l = diag ( ˆ W T l ˆ W l ) − 1 ˆ W l , (45) wher e diag ( · ) denotes the diagonal part of the matrix. While the main motivation for this method is a faster training, the authors also show empirically that such networks achie ve good generalization. Note that for r ow-normalized weight matrices k W l k F = √ M l and therefor e the bounds based on the May 24, 2017 DRAFT 18 F r obenious norm can not e xplain the good generalization of such networks as adding layers or making W l lar ger will lead to a lar ger GE bound. However , our bound in (34) and the constraint sets in (39) - (41) show that a small F r obenious norm of the weight matrices is not crucial for a small GE. A supporting experiment is pr esented in Section VI-A2. W e also note that batch normalization also leads to r ow-normalized weight matrices in DNNs with ReLUs: 5 Theorem 5. Assume that the non-linear layers of a DNN with ReLUs are batch normalized as: z l +1 = [ N  { z l i } m i =1 , W l  ˆ z l ] σ , ˆ z l = W l z l , (46) wher e σ denotes the ReLU non-linearity and N ( { z i } m i =1 , W ) = diag m X i =1 Wz i z T i W T ! − 1 2 (47) is the normalization matrix. Then all the weight matrices ar e r ow normalized. The e xception is the weight matrix of the last layer , which is of the form N ( { z L − 1 i } m i =1 , W L ) W L . Pr oof: The proof appears in Appendix E. A. J acobian Re gularizer The constraint set (39) suggests that we can regularize the DNN by bounding the norm of the network’ s JM for the inputs close to x i . Therefore, we propose to penalize the norm of the network’ s JM ev aluated at each training sample x i , R J ( W ) = 1 m m X i =1 k J ( x i ) k 2 2 . (48) The implementation of such re gularizer requires computation of its gradients or subgradients. In this case the computation of the subgradient of the spectral norm requires the calculation of a SVD decomposition [44], which makes the proposed regularizer inefficient. T o circumvent this, we propose a surrogate regularizer based on the Frobenious norm of the Jacobian matrix: R F ( W ) = 1 m m X i =1 k J ( x i ) k 2 F . (49) 5 T o simplify the derivation we omit the bias vectors and therefore also the centering applied by the batch normalization. This does not affect the generality of the result. W e also follow [43] and omit the batch normalization scaling, as it can be included into the weight matrix of the layer following the batch normalization. W e also omit the regularization term and assume that the matrices are inv ertible. May 24, 2017 DRAFT 19 Note that the Frobenious norm and the spectral norm are related as follows: 1 / rank ( J ( x i )) k J ( x i ) k 2 F ≤ k J ( x i ) k 2 2 ≤ k J ( x i ) k 2 F , which justifies using the surrogate regularizer . W e will refer to R F ( W ) as the Jacobian regularizer . 1) Computation of Gradients and Efficient Implementation: Note that the k -th row of J ( x i ) corresponds to the gradient of ( f ( x )) k with respect to the input x e valuated at x i . It is denoted by g k ( x i ) = d ( f ( x )) k d x | x = x i . Now we can write R F ( W ) = 1 m m X i =1 M X k =1 g k ( x i ) g k ( x i ) T . (50) As the regularizer will be minimized by a gradient descent algorithm we need to compute its gradient with respect to the DNN parameters. First, we express g k ( x i ) as g k ( x i ) = g l k ( x i ) W l J l − 1 ( x i ) (51) where g l k ( x i ) = d ( f ( x )) k d ˆ z l | x = x i is the gradient of ( f ( x )) k with respect to ˆ z l e valuated at the input x i and J l − 1 ( x i ) = d z l − 1 d x | x = x i is the JM of l − 1 -th layer output z l − 1 e valuated at the input x i . The gradient of g k ( x i ) g k ( x i ) T with respcet to W l is then gi ven as [45] ∇ W l  g k ( x i ) g k ( x i ) T  = 2 g l k ( x i ) T g l k ( x i ) W l J l − 1 ( x i ) . The computation of the gradient of the regularizer at layer l requires the computation of gradients g l k ( x i ) , k = 1 , . . . , M , i = 1 , . . . , m and the computation of the Jacobian matrices J l − 1 ( x i ) , i = 1 , . . . , m . The computation of the gradient of a typical loss used for training DNN usually in volv es a computation of m gradients with computational complexity similar to the computational comple xity of g l k ( x i ) . Therefore, the computation of gradients required for an implementation of the Jacobian regularizer can be very expensi ve. T o av oid e xcessiv e computational comple xity we propose a simplified version of the regularizer (49) , which we name per-layer Jacobian regularizer . The per-layer Jacobian regularizer at layer l is defined as R l F ( W l ) = 1 m m X i =1 ˜ g l − 1 π ( i ) ( x i )( ˜ g l − 1 π ( i ) ( x i )) T , (52) where ˜ g l − 1 π ( i ) ( x i ) = d ( f ( x )) π ( i ) d z l − 1 | x = x i , and π ( i ) ∈ { 1 , . . . , M } is a random index. Compared to (49) we hav e made two simplifications. First, we assumed that input of layer l is fixed. This way we do not need to compute the JM J l − 1 ( x i ) between the output of the layer l − 1 and the input. Second, by choosing only one index π ( i ) per training sample we ha ve to compute only one additional gradient per training sample. This significantly reduces the computational comple xity . The gradient of ˜ g l − 1 π ( i ) ( x i )( ˜ g l − 1 π ( i ) ( x i )) T is simply ∇ W l  ˜ g l − 1 π ( i ) ( x i )( ˜ g l − 1 π ( i ) ( x i )) T  = 2 g l π ( i ) ( x i ) T g l π ( i ) ( x i ) W l . W e demonstrate the ef fectiv eness of this regularizers in Section VI. May 24, 2017 DRAFT 20 V . D I S C U S S I O N In the preceding sections we analysed the standard feed-forward DNNs and their classification margin measured in the Euclidean norm. W e no w briefly discuss how our results extend to other DNN architectures and different margin metrics. A. Be yond F eed F orwar d DNN There are v arious DNN architectures such as Residual Networks (ResNets) [4], [46], Recurrent Neural Networks (RNNs) and Long Short-T erm Memory (LSTM) networks [47], and Auto-encoders [48] that are used frequently in practice. It turns out that our analysis – which is based on the network’ s JM – can also be easily extended to such DNN architectures. In fact, the proposed frame work encompasses all DNN architectures for which the JM can be computed.Belo w we compute the JM of a ResNet. The ResNets introduce shortcut connection between layers. In particular, let φ ( · , θ l ) denote a con- catenation of se veral non-linear layers (see (18) ). The l -th block of a Residual Netw ork is then gi ven as z l = z l − 1 + φ ( z l − 1 , θ l ) . (53) W e denote by J l ( z l − 1 ) the JM of φ ( z l − 1 , θ l ) . Then the JM of the l -th block is d z l d z l − 1 = I + J l ( z l − 1 ) , (54) and the JM of a ResNet is of the form J S M ( z L − 1 ) · I + L X l =1 J l ( z l − 1 ) l − 1 Y i =1 ( I + J l − i ( z l − 2 )) !! , (55) where J S M ( z L − 1 ) denotes the JM of the soft-max layer . In particular , the right element of the product in (55) can be expanded as I + J 1 ( x ) + J 2 ( z 1 ) + J 2 ( z 1 ) J 1 ( x ) + J 3 ( z 2 ) + J 3 ( z 2 ) J 2 ( z 1 ) J 1 ( x ) + J 3 ( z 2 ) J 2 ( x ) + J 3 ( x ) J 1 ( x ) + . . . This is a sum of JMs of all the possible sub-networks of a ResNet. In particular, there are L elements of the sum consiting only of one 1-layer sub-networks and there is only one element of the sum consisting of L -layer sub-network. This observ ation is consistent with the claims in [49], which states that ResNets resemble an ensemble of relati vely shallow networks. May 24, 2017 DRAFT 21 B. Be yond the Euclidean Metric Moreov er , we can also consider the geodesic distance on a manifold as a measure for margin instead of the Euclidean distance. The geodesic distance can be more appropriate than the Euclidean distance since it is a natural metric on the manifold. Moreo ver , the co vering number of the manifold X may be smaller if we use the co vering based on the geodesic metric balls, which will lead to tighter GE bounds. W e outline the approach below . Assume that X is a Riemannian manifold and x , x 0 ∈ X . T ake a continuous, piecewise continuously dif ferentiable curve c ( t ) , t = [0 , 1] such that c (0) = x , c (1) = x 0 and c ( t ) ∈ X ∀ t ∈ [0 , 1] . The set of all such curves c ( · ) is denoted by C . Then the geodesic distance between x and x 0 is defined as d G ( x , x 0 ) = inf c ( t ) ∈C Z 1 0     dc ( t ) dt     2 dt . (56) Similarly as in Section III, we can show that the JM of DNN is central to bounding the distance expansion between the signals at the DNN input and the signals at the DNN output. Theorem 6. T ake x , x 0 ∈ X , wher e X is the Riemmanian manifold and take c ( t ) , t = [0 , 1] to be a continuous, piecewise continuously differ entiable curve connecting x and x 0 such that d G ( x , x 0 ) = R 1 0    dc ( t ) dt    2 dt . Then k f ( x 0 ) − f ( x ) k 2 ≤ sup t ∈ [0 , 1] k J ( c ( t )) k 2 d G ( x 0 , x ) (57) Pr oof: The proof appears in Appendix F. Note that we have established a relationship between the Euclidean distance of two points in the output space and the corresponding geodesic distance in the input space. This is important because it implies that promoting a large Euclidean distance between points can lead to a lar ge geodesic distance between the points in the input space. Moreo ver , the ratio between k f ( x 0 ) − f ( x ) k 2 and d G ( x , x 0 ) is upper bounded by the maximum v alue of the spectral norm of the network’ s JM ev aluated on the line c ( t ) . This result is analogous to the results of Theorem 3 and Corollary 2. It also implies that regularizing the netw ork’ s JM as proposed in Section IV is beneficial also in the case when the classification margin is not measured in the Euclidean metric. Finally , note that in practice the training data may not be balanced. The pro vided GE bounds are still v alid in such cases. Ho wev er, the classification error may not the best measure of performance in such cases as it is dominated by the classification error of the class with the highest prior probability . Therefore, alternati ve performance measures need to be considered. W e leave a detailed study of training DNN with unbalanced training sets for possible future work. May 24, 2017 DRAFT 22 2 3 4 Number of Layers 93 94 95 96 97 98 99 Accuracy [%] (a) MNIST 2 3 4 Number of Layers 44 46 48 50 52 54 56 58 60 62 64 Accuracy [%] (b) CIF AR-10 Fig. 3. Classification accuracy of DNNs trained with the Jacobian regularization (solid lines) and the weight decay (dashed lines). Different numbers of training samples are used: 5000 (red), 20000 (blue) and 50000 (black). V I . E X P E R I M E N T S W e now validate the theory with a series of experiments on the MNIST [50], CIF AR-10 [51], LaRED [52] and ImageNet (ILSVRC2012) [53] datasets. The Jacobian re gularizer is applied to various DNN architectures such as DNN with fully connected layers, con volutional DNN and ResNet [4]. W e use the ReLUs in all considered DNNs as this is currently the most popular non-linearity . A. Fully Connected DNN In this section we compare the performance of fully connected DNNs regularized with Jacobian Regularization or with the weight decay . Then we analyse the beha viour of the JM of a fully connected DNNs of v arious depth and width. 1) Comparison of J acobian Re gularization and W eight Decay: First, we compare standard DNN with fully connected layers trained with the weight decay and the Jacobian re gularization (49) on the MNIST and CIF AR-10 datasets. Different number of training samples are used (5000, 20000, 50000). W e consider DNNs with 2, 3 and 4 fully connected layers where all layers, except the last one, have dimension equal to the input signal dimension, which is 784 in case of MNIST and 3072 in case of CIF AR-10. The last layer is alw ays the softmax layer and the objecti ve is the CEE loss. The networks were trained using the stochastic gradient descent (SGD) with momentum, which w as set to 0.9. Batch size was set to 128 and learning rate was set to 0.01 and reduced by factor 10 after e very 40 epochs. The networks were trained for 120 epochs in total. The weight decay and the Jacobian regularization factors were chosen on a separate validation set. The experiments were repeated with the same re gularization parameters on 5 random draws of training sets and weight matrix initializations. Classification accuracies a veraged ov er dif ferent e xperimental runs are sho wn in Fig. 3. W e observ e that the proposed Jacobian re gularization always outperforms the weight decay . This v alidates our theoretical results in Section IV, which predict May 24, 2017 DRAFT 23 that the Jacobian matrix is crucial for the control of (the bound to) the GE. Interestingly , in the case of MNIST , a 4 layer DNN trained with 20000 training samples and Jacobian regularization (solid blue line if Fig. 3 (a)) performs on par with DNN trained with 50000 training samples and weight decay (dashed black line Fig. 3 (a)), which means that the Jacobian regularization can lead to the same performance with significantly less training samples. 2) Analysis of W eight Normalized Deep Neur al Networks: Next, we explore weight normalized DNNs, which are described in Section IV. W e use the MNIST dataset and train DNNs with a different number of fully connected layers ( L = 2 , 3 , 4 , 5 ) and different sizes of weight matrices ( M l = 784 , 2 · 784 , 3 · 784 , 4 · 784 , 5 · 784 , 6 · 784 , l = 1 , . . . , L − 1 ) . The last layer is alw ays the softmax layer and the objective is the CCE loss. The networks were trained using the stochastic gradient descent (SGD) with momentum, which was set to 0.9. Batch size was set to 128 and learning rate was set to 0.1 and reduced by factor 10 after e very 40 epochs. The networks were trained for 120 epochs in total. All experiments are repeated 5 times with dif ferent random dra ws of a training set and different random weight initializations. W e did not employ any additional regularization as our goal here is to explore the effects of the weight normalization on the DNN behaviour . W e always use 5000 training samples. The classification accuracies are shown in Fig. 4 (a) and the smallest classification score obtained on the training set is shown in Fig. 4 (b). W e hav e observed for all configurations that the training accuracies were 100% (only exception is the case L = 2 , M l = 784 where the training accuracy was 99.6%). Therefore, the (testing set) classification accuracies increasing with the netw ork depth and the weight matrix size directly imply that the GE is smaller for deeper and wider DNNs. Note also that the score increases with the network depth and width. This is most ob vious for the 2 and 3 layer DNNs, whereas for the 3 and 4 layer DNNs the score is close to √ 2 for all network widths. Since the DNNs are weight normalized, the Frobenious norms of the weight matrices are equal to the square root of the weight matrix dimension, and the product of Frobenious norms of the weight matrices gro ws with the network depth and the weight matrix size. The increase of score with the network depth and network width does not of fset the product of Frobenious norms, and clearly , the bound in (38) based on the mar gin bound in (37) and the bound in (44) , which le verage the Frobenious norms of the weight matrices, predict that the GE will increase with the network depth and weight matrix size in this scenario. Therefore, the e xperiment indicates that these bounds are too pessimistic. W e have also inspected the spectral norms of the weight matrices of the trained networks. In all cases the spectral norms were greater than one. W e can argue that the bound in (38) based on the mar gin bound in (36) predicts that the GE will increase with network depth, as the product of the spectral norms gro ws with the network depth in a similar way than in pre vious paragraph. W e note howe ver , that the spectral May 24, 2017 DRAFT 24 norms of the weight matrices are much smaller than the Frobenious norms of the weight matrices. Finally , we look for a possible explanation for the success of the weight normalization in the bounds in (38) based on the margin bounds in (34) and (35) , which are a function on the JM. The largest v alue of the spectral norm of the network’ s JM ev aluated on the training set is shown in Fig. 4 (c) and the largest v alue of the spectral norm of the network’ s JM e v aluated on the testing set is shown in Fig. 4 (d) . W e can observe an interesting phenomena. The maximum value of the JM’ s spectral norm on the training set decreases with the network depth and width. On the other hand, the maximum v alue of the JM’ s spectral norm on the testing set increases with netw ork depth (and slightly with network width). From the perspective of the constraint sets in (39) and (40) we note that in the case of the latter we have to take into account the w orst case spectral norm of the JM for inputs in con v ( X ) . The maximum value of the spectral norm on the testing set indicates that this v alue increases with the netw ork depth and implies that the bound based on (35) is still loose. On the other hand, the bound in (34) implies that we have to consider the JM in the neighbourhood of the training samples. As an approximation, we can take the spectral norms of the JMs ev aluated at the training set. As it is sho wn in Fig. 4 (c) this values decrease with the network depth and width. W e ar gue that this pro vides a reasonable explanation for the good generalization of deeper and wider weight normalized DNNs. B. Con volutional DNN In this section we compare the performance of conv olutional DNNs regularized with the Jacobian regularizer or with the weight decay . W e also sho w that Jacobian Regularization can be applied to batch normalized DNNs. W e will use the standard MNIST and CIF AR-10 dataset and the LaRED dateset which is briefly described below . The LaRED dataset contains depth images of 81 distinct hand gestures performed by 10 subjects with approximately 300 images of each gesture per subject. W e extracted the depth images of the hands using the masks provided in [52] and resized the images to 32 × 32 . The images of the first 6 subjects were used to create non-overlapping training and testing sets. In addition we also constructed a testing set composed from the images of the last 4 subjects in the dataset in order to test generalization across dif ferent subjects. The goal is classification of gestures based on the depth image. 1) Comparison of J acobian Re gularization and W eight Decay: W e use a 4 layer con volutional DNN with the follo wing architecture: (32 , 5 , 5) -con v , (2 , 2) -max-pool, (32 , 5 , 5) -con v , (2 , 2) -max-pool followed by a softmax layer , where ( k , u, v ) -con v denotes the con volutional layer with k filters of size u × v and ( p, p ) -max-pool denotes max-pooling with pooling regions of size p × p . The training procedure follows the one described in the pre vious paragraphs. The results are reported in T able II. May 24, 2017 DRAFT 25 4704 3920 3136 2352 1568 784 Layer width 94.6 94.8 95.0 95.2 95.4 95.6 95.8 96.0 Accuracy [%] L=2 L=3 L=4 L=5 (a) Classification accuracy . 4704 3920 3136 2352 1568 784 Layer width 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 m i n i o ( s i ) (b) Smallest o ( s i ) (training set). 4704 3920 3136 2352 1568 784 Layer width 0.0 0.5 1.0 1.5 2.0 2.5 m a x i | J ( x i ) | 2 ( t r a i n s e t ) L=2 L=3 L=4 L=5 (c) Largest k J ( x i ) k 2 (training set). 4704 3920 3136 2352 1568 784 Layer width 2 3 4 5 6 7 8 9 10 m a x i | J ( x i ) | 2 ( t e s t s e t ) (d) Largest k J ( x i ) k 2 (test set). Fig. 4. W eight normalized DNN with L = 2 , 3 , 4 , 5 layers and different sizes of weight matrices (layer width). Plot (a) shows classification accuracy , plot (b) shows the smallest score of training samples, plot (c) shows the largest spectral norm of the network’ s JM e valuated on the training set and plot (d) shows the largest spectral norm of the network’ s JM evaluated on the testing set. T ABLE II C L A S S I FI C A T I O N AC C . [%] O F C O N VO L U T I O N A L D N N O N M I N S T A N D L A R E D . MNIST LaRED (same subject) LaRED (different subject) # train samples W eight dec. Jac. reg. # train samples W eight dec. Jac. re g. # train samples W eight dec. Jac. reg. 1000 94.00 96.03 2000 61.40 63.56 2000 31.53 32.62 5000 97.59 98.20 5000 76.59 79.14 5000 38.11 39.62 20000 98.60 99.00 10000 87.01 88.24 10000 41.18 42.85 50000 99.10 99.35 50000 97.18 97.54 50000 45.12 46.78 W e observe that training with the Jacobian regularization outperforms the weight decay in all cases. This is most obvious at smaller training set sizes. For example, on the MNIST dataset, the DNN trained using 1000 training samples and regularized with the weight decay achiev es classification accuracy of 94% and the DNN trained with the Jacobian regularization achieves classification accurac y of 96.3%. Similarly , on the LaRED dataset the Jacobian regularization outperforms the weight decay with the dif ference most obvious at the smallest number of training samples. Note also that the generalization of May 24, 2017 DRAFT 26 T ABLE III C L A S S I FI C A T I O N AC C . [%] O F C O N VO L U T I O N A L D N N O N C I FA R - 1 0 . # train samples Batch norm. Batch norm. + Jac. reg. 2500 60.86 66.15 10000 76.35 80.57 50000 87.44 88.95 the network to the subjects outside the training set is not very good; i.e., using 50000 training samples the classification accuracy on the testing set containing the same subjects is higher than 97% whereas the classification accurac y on the testing set containing dif ferent subjects is only 46%. Ne vertheless, the Jacobian regularization outperforms the weight decay also on this testing set by a small margin. 2) Batc h Normalization and J acobian Re gularization: No w we sho w that the Jacobian regularization (49) can also be applied to a batch normalized DNN. Note that we ha ve sho wn in Section IV that the batch normalization has an ef fect of normalizing the rows of the weight matrices. W e us the CIF AR-10 dataset and use the All-con v olutional-DNN proposed in [54] (All-CNN-C) with 9 con volutional layers, an av erage pooling layer and a softmax layer . All the con v olutional layers are batch normalized and the softmax layer is weight normalized. The netw orks were trained using the stochastic gradient descent (SGD) with momentum, which was set to 0.9. Batch size w as set to 64 and the learning rate was set to 0.1 and reduced by a factor 10 after e very 25 epochs. The networks were trained for 75 epochs in total. The classification accuracy results are presented in T able III for dif ferent sizes of training sets (2500, 10000, 50000). W e can observ e that the Jacobian regularization also leads to a smaller GE in this case. C. Residual Networks No w we demonstrate that the Jacobian regularizer is also effecti ve when applied to ResNets. W e use the CIF AR-10 and ImageNet datasets. W e use the per -layer Jacobian re gularization (52) for experiments in this section. 1) CIF AR-10: The W ide ResNet architecture proposed in [35], which follo ws [46], but proposes wider and shallower networks which leads to the same or better performance than deeper and thinner netw orks is used here. In particular , we use the ResNet with 22 layers of width 5. W e follow the data normalization process of [35]. W e also follo w the training procedure of [35] e xcept for the learning rate and use the learning rate sequence: (0.01, 5), (0.05, 20), (0.005, 40), (0.0005, 40), (0.00005, 20), where the first number in parenthesis corresponds to the learning rate and the second May 24, 2017 DRAFT 27 T ABLE IV C L A S S I FI C A T I O N AC C . [%] O F R E S N E T C I FA R - 1 0 # train samples ResNet ResNet + Jac. reg. 2500 55.69 62.79 10000 71.79 78.70 50000 + aug. 93.34 94.32 number corresponds to the number of epochs. W e train ResNet on small training sets (2500 and 10000 training samples) without augmentation and on the full training set with the data augmentation as in [35]. The regularization factor were set to 1 and 0 . 1 for the smaller training sets (2500 and 10000) and the full augmented training set, respectively . The results are presented in T able IV. In all cases the ResNet with Jacobian re gularization outperforms the standard ResNet. The effect of re gularization is the strongest with the smaller number of training samples, as e xpected. 2) Ima geNet: W e use the 18 layer ResNet [4] with identity connection [46]. The training procedure follo ws [4] with the learning rate sequence: (0.1, 30), (0.01, 30), (0.001, 30). The Jacobian regularization factor is set to 1. The images in the dataset are resized to 128 × 128 . W e run an experiment without data augmentation and with data augmentation follo wing [1], which includes random cropping of images of size 112 × 112 from the original image and color augmentation. The classification accuracies during training are shown in Fig. 5 and the final results are reported in T able V. W e first focus on training without data augmentation. The ResNet trained using the Jacobian regularization has a much smaller GE (23.83%) compared to the baseline ResNet (61.53%). This again demonstrates that the Jacobian regularization decreases the GE, as our theory predicts. Note that the smaller GE of Jacobian regularized ResNet partially transfers to a higher classification accuracy on the testing set. Ho wev er, in practice DNNs are often trained with data augmentation. In this case the GE of a baseline ResNet is much lo wer (13.14%) and is very close to the GE of the ResNet with the Jacobian regularization (12.03%). It is clear that data augmentation reduces the need for strong regularization. Nevertheless, note that the ResNet trained with the Jacobian regularization achiev es a slightly higher testing set accuracy (47.51%) compared to the baseline ResNet (46.75%). May 24, 2017 DRAFT 28 T ABLE V C L A S S I FI C A T I O N AC C . [%] A N D G E [%] O F R E S N E T O N I M AG E N E T . Setup T rain T est GE Baseline 89.82 28.29 61.53 Baseline + Jac. reg. 59.52 35.69 23.83 Baseline + aug. 59.89 46.75 13.14 Baseline + aug + Jac. reg. 59.54 47.51 12.03 0 10 20 30 40 50 60 70 80 90 Epoch 10 20 30 40 50 60 70 80 90 Top-1 accuracy [%] (a) No data augmentation. 0 10 20 30 40 50 60 70 80 90 Epoch 30 40 50 60 70 80 90 100 Top-5 accuracy [%] (b) No data augmentation. 0 10 20 30 40 50 60 70 80 90 Epoch 10 20 30 40 50 60 Top-1 accuracy [%] (c) Data augmentation. 0 10 20 30 40 50 60 70 80 90 Epoch 20 30 40 50 60 70 80 90 Top-5 accuracy [%] Train (baseline) Test (baseline) Train (Jac. reg.) Test (Jac. reg.) (d) Data augmentation. Fig. 5. Training set (dashed) and testing set (solid) classification accuracies during training. Blue curves correspond to the ResNet with Jacobian regularization and red curves correspond to the baseline ResNet. T op-1 and top-5 classification accuracies are reported for training without data augmentation (a,b) and for training with data augmentation (c,d). D. Computational T ime Finally , we measure how the use of Jacobian re gularization affects training time of DNNs. W e ha ve implemented DNNs in Theano [55], which includes automatic differentiation and computation graph optimization. The experiments are run on the T itan X GPU. The a verage computational time per batch for the con v olutional DNN on the MNIST dataset in Section VI-B 1 and for the ResNet on the ImageNet dataset in Section VI-C 2 are reported in T able VI. Note that in the case of MNIST the regularizer in (49) is used and in the case of ImageNet the per -layer re gularizer in (52) is used. These results are also representati ve of the other datasets and network architectures. May 24, 2017 DRAFT 29 T ABLE VI A V E R AG E C O M P U TA T I O N T I M E [ s / BAT C H ] Experiment no reg. Jac. reg. Increase factor MNIST (Sec.VI-B1) 0.003 0.030 10.00 ImageNet (Sec. VI-C2) 0.120 0.190 1.580 W e can observe that using the Jacobian re gularizer in (49) introduces additional computational time. This may not be critical if the number of training samples is small and training computational time is not too critical. On the other hand, the per -layer Jacobian re gularizer in (52) has a much smaller cost. As sho wn in the e xperiments this regularizer is still ef fectiv e and leads to only 58% increase in computation time on the ImageNet dataset. Due to its ef ficiency the per -layer Jacobian regularizer might be more appropriate for lar ge scale e xperiments where computational time is important. V I I . C O N C L U S I O N This paper studies the GE of DNNs based on their classification margin. In particular, our bounds express the generalization error as a function of the classification margin, which is bounded in terms of the achiev ed separation between the training samples at the network output and the network’ s JM. One of the hallmarks of our bounds relates to the fact that the characterization of the behaviour of the generalization error is tighter than that associated with other bounds in the literature. Our bounds predict that the generalization error of deep neural netw orks can be independent of their depth and size whereas other bounds say that the generalization error is exponential in the network width or size. Our bounds also suggest new regularization strategies such as the regularization of the network’ s Jacobian matrix, which can be applied on top of other modern DNN training strategies such as the weight normalization and the batch normalization, where the standard weight decay can not be applied. These regularization strategies are especially ef fective in the limited training data regime in comparison to other approaches, with moderate increase in computational complexity . A P P E N D I X A. Pr oof of Theor em 3 W e first note that the line between x and x 0 is giv en by x + t ( x 0 − x ) , t ∈ [0 , 1] . W e define the function F ( t ) = f ( x + t ( x 0 − x )) , and observe that dF ( t ) dt = J ( x + t ( x 0 − x ))( x 0 − x ) . By the generalized May 24, 2017 DRAFT 30 fundamental theorem of calculus or the Lebesgue dif ferentiation theorem we write f ( x 0 ) − f ( x ) = F (1) − F (0) = Z 1 0 dF ( t ) dt dt = Z 1 0 J ( x + t ( x 0 − x )) dt ( x 0 − x ) . (58) This concludes the proof. B. Pr oof of Cor ollary 2 First note that k J x , x 0 ( x 0 − x ) k 2 ≤ k J x , x 0 k 2 k x 0 − x k 2 and that J x , x 0 is an integral of J ( x + t ( x 0 − x )) . In addition, notice that we may always apply the follo wing upper bound: k J x , x 0 k 2 ≤ sup x , x 0 ∈X ,t ∈ [0 , 1] k J ( x + t ( x 0 − x )) k 2 . (59) Since x + t ( x 0 − x ) ∈ con v ( X ) ∀ t ∈ [0 , 1] , we get (24). C. Pr oof of Lemma 1 In all proofs we leverage the f act that for any two matrices A , B of appropriate dimensions it holds k AB k 2 ≤ k A k 2 k B k 2 . W e also le verage the bound k A k 2 ≤ k A k F . W e start with the proof of statement 1). F or the non-linear layer (18) , we note that the JM is a product of a diagonal matrix (27) and the weight matrix W l . Note that for all the considered non-linearities the diagonal elements of (27) are bounded by 1 (see deriv ativ es in T able I), which implies that the spectral norm of this matrix is bounded by 1. Therefore the spectral norm of the JM is upper bounded by k W l k 2 . The proof for the linear layer is trivial. In the case of the softmax layer (17) we hav e to show that the spectral norm of the softmax function  − ζ ( ˆ z ) ζ ( ˆ z ) T + diag ( ζ ( ˆ z )  is bounded by 1. W e use the Gershgorin disc theorem, which states that the eigen values of  − ζ ( ˆ z ) ζ ( ˆ z ) T + diag ( ζ ( ˆ z )  are bounded by max i ( ζ ( ˆ z )) i (1 − ( ζ ( ˆ z )) i ) + ( ζ ( ˆ z )) i X j 6 = i ( ζ ( ˆ z )) j . (60) Noticing that P j 6 = i ( ζ ( ˆ z )) j ≤ 1 leads to the upper bound max i ( ζ ( ˆ z )) i (2 − ( ζ ( ˆ z )) i ) . (61) Since ( ζ ( ˆ z )) i ∈ [0 , 1] it is tri vial to sho w that (61) is upper bounded by 1. The proof of statement 2) is straightforward. Because the pooling re gions are non-overlapping it is straightforward to v erify that the rows of all the defined pooling operators P l ( z l − 1 ) are orthonormal. Therefore, the spectral norm of the JM is equal to 1. May 24, 2017 DRAFT 31 D. Pr oof of Theor em 4 Throughout the proof we will use the notation o ( s i ) = o ( x i , y i ) and v ij = √ 2( δ i − δ j ) . W e start by proving the inequality in (34) . Assume that the classification mar gin γ d ( s i ) of training sample ( x i , y i ) is given and take j ? = arg min j 6 = y i min v T y i j f ( x i ) . W e then take a point x ? that lies on the decision boundary between y i and j ? such that o ( x ? , y i ) = 0 . Then o ( x i , y i ) = o ( x , y i ) − o ( x ? , y i ) = v T y i j ? ( f ( x i ) − f ( x ? )) = v T y i j ? J x i , x ? ( x i − x ? ) ≤ k J x i , x ? k 2 k x i − x ? k 2 . Note that by the choice of x ? , k x i − x ? k 2 = γ d ( s i ) and similarly k J x i , x ? k 2 ≤ sup x : k x − x i k 2 ≤ γ d ( s i ) k J ( x ) k 2 . Therefore, we can write o ( s i ) ≤ sup x : k x − x i k 2 ≤ γ d ( s i ) k J ( x ) k 2 γ d ( s i ) , (62) which leads to (34). Next, we prove (35). Recall the definition of the classification margin in (10): γ d ( s i ) = sup { a : k x i − x k 2 ≤ a = ⇒ g ( x ) = y i ∀ x } = sup { a : k x i − x k 2 ≤ a = ⇒ o ( x , y i ) > 0 ∀ x } , where we le verage the definition in (33). W e observe o ( x , y i ) > 0 ⇐ ⇒ min j 6 = y i v T y i j f ( x ) > 0 and min j 6 = y i v T y i j f ( x ) = min j 6 = y i  v T y i j f ( x i ) + v T y i j ( f ( x ) − f ( x i ))  . Note that min j 6 = y i  v T y i j f ( x i ) + v T y i j ( f ( x ) − f ( x i ))  (63) ≥ min j 6 = y i v T y i j f ( x i ) + min j 6 = y i v T y i j ( f ( x ) − f ( x i )) = o ( x i , y i ) + min j 6 = y i v T y i j ( f ( x ) − f ( x i )) . (64) Therefore, o ( x i , y i ) + min j 6 = y i v T y i j ( f ( x ) − f ( x i )) > 0 = ⇒ o ( x , y i ) > 0 . This leads to the bound of the classification mar gin γ d ( s i ) ≥ sup { a : k x i − x k 2 ≤ a = ⇒ o ( x i , y i ) + min j 6 = y i v T y i j ( f ( x ) − f ( x i )) > 0 ∀ x } . May 24, 2017 DRAFT 32 Note now that o ( x i , y i ) + min j 6 = y i v T y i j ( f ( x ) − f ( x i )) > 0 (65) ⇐ ⇒ o ( x i , y i ) − max j 6 = y i v T y i j ( f ( x i ) − f ( x )) > 0 (66) ⇐ ⇒ o ( x i , y i ) > max j 6 = y i v T y i j ( f ( x i ) − f ( x )) . (67) Moreov er , max j 6 = y i v T y i j ( f ( x i ) − f ( x )) ≤ sup x ∈ con v ( X ) k J ( x ) k 2 k x i − x k 2 , where we ha ve le veraged the f act that k v ij k 2 = 1 and the inequality (24) in Corollary 2. W e may write γ d ( s i ) ≥ sup { a : k x i − x k 2 ≤ a = ⇒ o ( x i , y i ) > sup x ∈ con v ( X ) k J ( x ) k 2 k x i − x k 2 ∀ x } . a that attains the supremum can be obtain easily and we get: γ d ( s i ) ≥ o ( x i , y i ) sup x ∈ con v ( X ) k J ( x ) k 2 , (68) which proves (35) . The bounds in (36) and (37) follo w from the bounds provided in Lemma 1 and the fact that the spectral norm of a matrix product is upper bounded by the product of the spectral norms. This concludes the proof. E. Pr oof of Theor em 5 W e denote by W N l the ro w normalized matrix obtained from W l (in the same way as (45) ). By noting that the ReLU and diagonal non-negati ve matrices commute, it is straight forward to v erify that [ N  { z l i } m i =1 , W l  W l z l ] σ = N  { z l i } m i =1 , W N l  [ W N l z l ] σ . Note now that we can consider N ( { z l i } m i =1 , W N l ) as the part of the weight matrix W l +1 . Therefore, we can conclude that layer l has row normalized weight matrix. When the batch normalization is applied to layers, all the weight matrices will be ro w normalized. The exception is the weight matrix of the last layer , which will be of the form N ( { z L − 1 i } m i =1 , W L ) W L . May 24, 2017 DRAFT 33 F . Pr oof of Theorem 6 W e begin by noting that f ( x 0 ) − f ( x ) = f ( c (1)) − f ( c (0)) and f ( c (1)) − f ( c (0)) = R 1 0 d f ( c ( t )) dt dt = R 1 0 d f ( c ( t )) dc ( t ) dc ( t ) dt dt , where the first equality follo ws from the generalized fundamental theorem of calculus, following the idea presented in the proof of Theorem 3. The second equality follo ws from the chain rule of differentiation. Finally , we note that d f ( c ( t )) c ( t ) = J ( c ( t )) and that the norm of the integral is al ways smaller or equal to the integral of the norm and obtain k f ( x 0 ) − f ( x ) k 2 =     Z 1 0 J ( c ( t )) dc ( t ) dt dt     2 ≤ Z 1 0 k J ( c ( t )) k 2     dc ( t ) dt     2 dt ≤ sup t ∈ [0 , 1] k J ( c ( t )) k 2 Z 1 0     dc ( t ) dt     2 dt = sup t ∈ [0 , 1] k J ( c ( t )) k 2 d G ( x , x 0 ) , (69) where we ha ve noted that R 1 0    dc ( t ) dt    2 = d G ( x , x 0 ) . R E F E R E N C E S [1] A. Krizhevsk y , I. Sutske ver , and G. E. Hinton, “Imagenet classification with deep con volutional neural networks, ” Advances in Neural Information Pr ocessing Systems (NIPS) , pp. 1097–1105, 2012. [2] G. Hinton, L. Deng, D. Y u, G. E. Dahl, A.-r. Mohamed, N. Jaitly , A. Senior , V . V anhoucke, P . Nguyen, T . N. Sainath, and B. Kingsbury , “Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, ” IEEE Signal Pr ocessing Magazine , vol. 29, no. 6, pp. 82–97, Oct. 2012. [3] Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning, ” Natur e , vol. 521, no. 7553, pp. 436–444, May 2015. [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” IEEE Conference on Computer V ision and P attern Recognition (CVPR) , Dec. 2016. [5] V . Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines, ” Pr oceedings of the 27th International Confer ence on Machine Learning (ICML) , pp. 807–814, 2010. [6] J. Bruna, A. Szlam, and Y . LeCun, “Learning stable group in variant representations with con v olutional networks, ” International Confer ence on Learning Repr esentations (ICLR) , 2013. [7] Y .-L. Boureau, J. Ponce, and Y . LeCun, “A theoretical analysis of feature pooling in visual recognition, ” Proceedings of the 27th International Conference on Machine Learning (ICML) , pp. 111–118, 2010. [8] G. Cybenko, “Approximation by superpositions of a sigmoidal function, ” Mathematics of Control Signals and Systems , vol. 2, no. 4, pp. 303–314, 1989. [9] K. Hornik, “Approximation capabilities of multilayer feedforward networks, ” Neural Networks , vol. 4, no. 2, pp. 251–257, 1991. [10] G. Montúfar , R. Pascanu, K. Cho, and Y . Bengio, “On the number of linear regions of deep neural networks, ” Advances in Neural Information Pr ocessing Systems (NIPS) , pp. 2924–2932, 2014. May 24, 2017 DRAFT 34 [11] N. Cohen, O. Sharir, and A. Shashua, “On the expressiv e power of deep learning: a tensor analysis, ” 29th Annual Confer ence on Learning Theory (COLT) , pp. 698–728, 2016. [12] M. T elgarsky , “Benefits of depth in neural networks, ” 29th Annual Conference on Learning Theory (COLT) , pp. 1517–1539, 2016. [13] S. Mallat, “Group inv ariant scattering, ” Communications on Pure and Applied Mathematics , vol. 65, no. 10, pp. 1331–1398, 2012. [14] J. Bruna and S. Mallat, “Inv ariant scattering conv olution networks, ” IEEE T ransactions on P attern Analysis and Machine Intellignce , vol. 35, no. 8, pp. 1872–1886, Mar . 2012. [15] T . W iatowski and H. Bölcskei, “A mathematical theory of deep con volutional neural networks for feature extraction, ” arXiv:1512.06293 , 2015. [16] R. Giryes, G. Sapiro, and A. M. Bronstein, “Deep neural networks with random Gaussian weights: a universal classification strategy?” IEEE T ransactions on Signal Pr ocessing , vol. 64, no. 13, pp. 3444–3457, Jul. 2016. [17] A. Choromanska, M. Henaf f, M. Mathieu, G. B. Arous, and Y . LeCun, “The loss surf aces of multilayer networks, ” International Confer ence on Artificial Intelligence and Statistics (AISTA TS) , 2015. [18] B. D. Haeffele and R. V idal, “Global optimality in tensor factorization, deep learning, and beyond, ” , 2015. [19] R. Giryes, Y . C. Eldar, A. M. Bronstein, and G. Sapiro, “Tradeof fs between conv ergence speed and reconstruction accuracy in in verse problems, ” , 2016. [20] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, ” International Confer ence on Learning Repr esentations (ICLR) , 2014. [21] Y . Ollivier , “Riemannian metrics for neural networks I: feedforward networks, ” Information and Infer ence , vol. 4, no. 2, pp. 108–153, Jun. 2015. [22] B. Neyshab ur and R. Salakhutdinov , “Path-sgd: Path-normalized optimization in deep neural networks, ” Advances in Neural Information Pr ocessing Systems (NIPS) , pp. 2422–2430, 2015. [23] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” Pr oceedings of the 32nd International Conference on Machine Learning (ICML) , pp. 448–456, 2015. [24] T . Salimans and D. P . Kingma, “W eight normalization: A simple reparameterization to accelerate training of deep neural networks, ” Advances in Neural Information Pr ocessing Systems (NIPS) , pp. 901–909, 2016. [25] S. An, M. Hayat, S. H. Khan, M. Bennamoun, F . Boussaid, and F . Sohel, “Contractive rectifier networks for nonlinear maximum margin classification, ” Pr oceedings of the IEEE International Conference on Computer V ision , pp. 2515–2523, 2015. [26] N. Sriv astav a, G. Hinton, A. Krizhe vsky , I. Sutske ver , and R. Salakhutdinov , “Dropout: A simple way to prevent neural networks from overfitting, ” Journal of Machine Learning Resear ch (JMLR) , vol. 15, no. 1, pp. 1929–1958, Jun. 2014. [27] S. Rifai, P . V incent, X. Muller, X. Glorot, and Y . Bengio, “Contractive auto-encoders: explicit inv ariance during feature extraction, ” Proceedings of the 28th International Conference on Machine Learning (ICML) , pp. 833–840, 2011. [28] J. Huang, Q. Qiu, G. Sapiro, and R. Calderbank, “Discriminative robust transformation learning, ” Advances in Neural Information Pr ocessing Systems (NIPS) , pp. 1333–1341, 2015. [29] V . N. V apnik, “An overview of statistical learning theory, ” IEEE T ransactions on Neural Networks , vol. 10, no. 5, pp. 988–999, Sep. 1999. [30] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: fr om theory to algorithms . Cambridge Univ ersity Press, 2014. May 24, 2017 DRAFT 35 [31] P . L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: risk bounds and structural results, ” Journal of Machine Learning Resear ch (JMLR) , vol. 3, pp. 463–482, 2002. [32] H. Xu and S. Mannor, “Robustness and generalization, ” Machine Learning , vol. 86, no. 3, pp. 391–423, 2012. [33] B. Neyshab ur, R. T omioka, and N. Srebro, “Norm-based capacity control in neural networks, ” Pr oceedings of The 28th Confer ence on Learning Theory (COLT) , pp. 1376–1401, 2015. [34] S. Sun, W . Chen, L. W ang, and T .-Y . Liu, “Large margin deep neural networks: theory and algorithms, ” , 2015. [35] S. Zagoruyko and N. K omodakis, “W ide residual networks, ” , 2016. [36] C. Zhang, S. Bengio, M. Hardt, and B. Recht, “Understanding deep learning requires rethinking generalization, ” arXiv:1611.03530 , 2016. [37] J. Sokoli ´ c, R. Giryes, G. Sapiro, and M. R. D. Rodrigues, “Generalization error of in variant classifiers, ” in International Confer ence on Artificial Intelligence and Statistics (AIST ATS) , 2017. [38] K. Q. Shen, C. J. Ong, X. P . Li, and E. P . V . Wilder -Smith, “Feature selection via sensitivity analysis of SVM probabilistic outputs, ” Machine Learning , vol. 70, no. 1, pp. 1–20, Jan. 2008. [39] J. B. Y ang, K. Q. Shen, C. J. Ong, and X. P . Li, “Feature selection via sensitivity analysis of MLP probabilistic outputs, ” IEEE International Conference on Systems, Man and Cybernetics , pp. 774–779, 2008. [40] D. Shi, D. S. Y eung, and J. Gao, “Sensitivity analysis applied to the construction of radial basis function networks, ” Neural networks , vol. 18, no. 7, pp. 951–957, Mar . 2005. [41] S. Mendelson, A. Pajor , and N. T omczak-Jaegermann, “Uniform uncertainty principle for Bernoulli and subgaussian ensembles, ” Constructive Appr oximation , vol. 28, no. 3, pp. 277–289, Dec. 2008. [42] N. V erma, “Distance preserving embeddings for general n-dimensional manifolds.” Journal of Machine Learning Resear ch (JMLR) , vol. 14, no. 1, pp. 2415–2448, Aug. 2013. [43] B. Neyshabur , R. T omioka, R. Salakhutdinov , and N. Srebro, “Data-dependent path normalization in neural networks, ” International Confer ence on Learning Representations (ICLR) , 2015. [44] G. W atson, “Characterization of the subdifferential of some matrix norms, ” Linear Algebra and its Applications , vol. 170, pp. 33–45, Jun. 1992. [45] K. B. Petersen and M. S. Pedersen, “The matrix cookbook, ” T echnical University of Denmark , 2012. [46] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks, ” , 2016. [47] S. Hochreiter and J. Schmidhuber, “Long short-term memory, ” Neural computation , vol. 9, no. 8, pp. 1735–1780, Nov . 1997. [48] Y . Bengio, “Learning deep architectures for AI, ” F oundations and trends ® in Machine Learning , vol. 2, no. 1, pp. 1–127, 2009. [49] A. V eit, M. W ilber, and S. Belongie, “Residual networks are exponential ensembles of relati vely shallow networks, ” arXiv:1605.06431 , 2016. [50] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient-based learning applied to document recognition, ” Pr oceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, Nov . 1998. [51] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images, ” Computer Science Department, University of T or onto, T ech. Rep , Apr . 2009. [52] Y . S. Hsiao, J. Sanchez-Riera, T . Lim, K. L. Hua, and W . H. Cheng, “LaRED: a large RGB-D extensible hand gesture dataset, ” Proceedings of the 5th ACM Multimedia Systems Confer ence , pp. 53–58, 2014. [53] O. Russakovsk y , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. Bernstein, A. C. May 24, 2017 DRAFT 36 Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge, ” International Journal of Computer V ision (IJCV) , vol. 115, no. 3, pp. 211–252, 2015. [54] J. T . Springenberg, A. Dosovitskiy , T . Brox, and M. Riedmiller, “Striving for simplicity: the all conv olutional net, ” International Confer ence on Learning Representations (ICLR - workshop trac k) , Dec. 2015. [55] Theano Dev elopment T eam, “Theano: A Python framework for fast computation of mathematical e xpressions, ” arXiv:1605.02688 , 2016. May 24, 2017 DRAFT

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment