Understanding Convolutional Neural Networks

Understanding Con volutional Neural Netw orks Jayanth K oushik Language T echnologies Institute Carnegie Mellon Uni versity Pittsbur gh, P A 15213 jkoushik@cs.cmu.edu Abstract Con voulutional Neural Networks (CNNs) exhibit e xtraordinary performance on a v ariety of machine learning tasks. Howe ver , their mathematical properties and behavior are quite poorly understood. There is some work, in the form of a frame work, for analyzing the operations that the y perform. The goal of this project is to present key results from this theory , and provide intuition for why CNNs work. 1 Introduction 1.1 The supervised learning pr oblem W e begin by formalizing the supervised learning problem which CNNs are designed to solve. W e will consider both regression and classiﬁcation, but restrict the label (dependent variable) to be uni variate. Let X ∈ X ⊂ R d and Y ∈ Y ⊂ R be two random variables. W e typically have Y = f ( X ) for some unknown f . Given a sample { ( x i , y i ) } i =1 ,...,n drawn from the joint distrib ution of X and Y , the goal of supervised learning is to learn a mapping ˆ f : X → Y which minimizes the expected loss, as deﬁned by a suitable loss function L : Y × Y → R . Howe ver , minimizing o ver the set of all functions from X to Y is ill-posed, so we restrict the space of hypotheses to some set F , and deﬁne ˆ f = arg min f ∈F E[ L ( Y , f ( X ))] (1) 1.2 Linearization A common strategy for learning classiﬁers, and the one employed by k ernel methods, is to linearize the variations in f with a feature representation. A feature representation is any transformation of the input v ariable X ; a change of v ariable. Let this transformation be giv en by Φ( X ) . Note that the transformed variable need not ha ve a lo wer dimension than X . W e would like to construct a feature representation such that f is linearly separable in the transformed space i.e. f ( X ) = h Φ( X ) , w i (2) for regression, or f ( X ) = sign ( h Φ( X ) , w i ) (3) for binary classiﬁcation 1 . Classiﬁcation algorithms like Support V ector Machines (SVM) [ 3 ] use a ﬁxed feature representation that may , for instance, be deﬁned by a kernel. 1 Multi-class classiﬁcation problems can be considered as multiple binary classiﬁcation problems. 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. 1.3 Symmetries The transformation induced by kernel methods do not always linearize f especially in the case of natural image classiﬁcation. T o ﬁnd suitable feature transformations for natural images, we must consider their in variance properties. Natural images sho w a wide range of in variances e.g. to pose, lighting, scale. T o learn good feature representations, we must suppress these intra-class v ariations, while at the same time maintaining inter-class v ariations. This notion is formalized with the concept of symmetries as deﬁned next. Deﬁnition 1 (Global Symmetry) Let g be an operator fr om X to X . g is a global symmetry of f if f ( g .x ) = f ( x ) ∀ x ∈ X . Deﬁnition 2 (Local Symmetry) Let G be a gr oup of operators fr om X to X with norm | . | . G is a gr oup of local symmetries of f if for each x ∈ X , ther e exists some C x > 0 such that f ( g .x ) = f ( x ) for all g ∈ G suc h that | g | < C x . Global symmetries rarely exist in real images, so we can try to construct features that linearize f along local symmetries. The symmetries we will consider are translations and diffeomorphisms, which are discussed next. 1.4 T ranslations and Diffeomorphisms Gi ven a signal x , we can interpolate its dimensions and deﬁne x ( u ) for all u ∈ R n ( n = 2 for images). A translation is an operator g giv en by g .x ( u ) = x ( u − g ) . A diffeomorphism is a deformation; small diffeomorphisms can be written as g .x ( u ) = x ( u − g ( u )) . W e seek feature transformations Φ which linearize the action of local translations and diffeomor - phisms. This can be expressed in terms of a Lipschitz continuity condition. k Φ( g .x ) − Φ( x ) k ≤ C | g |k x k (4) 1.5 Con volutional Neural Netw orks Con v olutional Neural Networks (CNNs), introduced by Le Cun et al. [6] are a class of biologically inspired neural networks which solv e equation (1) by passing X through a series of con volutional ﬁlters and simple non-linearities. They hav e shown remarkable results in a wide v ariety of machine learning problems [8]. Figure 1 sho ws a typical CNN architecture. A con volutional neural network has a hierarchical architecture. Starting from the input signal x , each subsequent layer x j is computed as x j = ρW j x j − 1 (5) Here W j is a linear operator and ρ is a non-linearity . T ypically , in a CNN, W j is a conv olution, and ρ is a rectiﬁer max( x, 0) or sigmoid 1 / 1+exp( − x ) . It is easier to think of the operator W j as a stack of con volutional ﬁlters. So the layers are ﬁlter maps and each layer can be written as a sum of con v olutions of the previous layer . x j ( u, k j ) = ρ  X k ( x j − 1 ( ., k ) ∗ W j,k j ( ., k ))( u )  (6) Here ∗ is the discrete con volution operator: ( f ∗ g )( x ) = ∞ X u = −∞ f ( u ) g ( x − u ) (7) The optimization problem deﬁned by a con volutional neural network is highly non-con vex. So typically , the weights W j are learned by stochastic gradient descent, using the backpropagation algorithm to compute gradients. 2 Figure 1: Architecture of a Con volutional Neural Network (from LeCun et al. [7]) 1.6 A mathematical framework f or CNNs Mallat [10] introduced a mathematical framew ork for analyzing the properties of con volutional networks. The theory is based on extensiv e prior work on wav elet scattering (see for example [ 2 , 1 ]) and illustrates that to compute in variants, we must separate v ariations of X at different scales with a wa velet transform. The theory is a ﬁrst step towards understanding general classes of CNNs, and this paper presents its key concepts. 2 The need f or wa velets Although the framew ork based on wa velet transforms is quite successful in analyzing the operations of CNNs, the moti vation or need for wa velets is not immediately ob vious. So we will ﬁrst consider the more general problem of signal processing, and study the need for wa velet transforms. In what follo ws, we will consider a function f ( t ) where t ∈ R can be considered as representing time, which makes f a time varying function like an audio signal. The concepts, ho we ver , extend quite naturally to images as well, when we change t to a two dimensional vector . Gi ven such a signal, we are often interested in studying its variations across time. With the image metaphor , this corresponds to studying the variations in dif ferent parts of the image. W e will consider a progression of tools for analyzing such variations. Most of the follo wing material is from the book by Gerald [5]. 2.1 Fourier transf orm The Fourier transform of f is deﬁned as ˆ f ( ω ) ≡ Z ∞ −∞ f ( t ) e − 2 π iωt dt (8) The Fourier transform is a po werful tool which decomposes f into the frequencies that mak e it up. Howe ver , it should be quite clear from equation (8) that it is useless for the task we are interested in. Since the integral is from −∞ to ∞ , ˆ f is an average ov er all time and does not have any local information. 2.2 Windo wed Fourier transf orm T o avoid the loss of information that comes from integrating over all time, we might use a weight function that localizes f in time. Without going into speciﬁcs, let us consider some function g supported on [ − T , 0] and deﬁne the windo wed Fourier transform (WFT) as ˜ f ( ω , t ) ≡ Z ∞ −∞ f ( u ) g ( u − t ) e − 2 π iωu du (9) It should be intuiti vely clear that the WFT can capture local v ariations in a time window of width T . Further , it can be shown that the WFT also provides accurate information about f in a frequency band 3 of some width Ω . So does the WFT solve our problem? Unfortunately not; and this is a consequence of Theorem 1 which is stated very informally ne xt. Theorem 1 (Uncertainty Principle) 2 Let f be a function which is small outside a time-interval of length T , and let its F ourier transform be small outside a fr equency-band of width Ω . There e xists a positive constant c such that Ω T ≥ c Because of the Uncertainty Principle, T and Ω cannot both be small. Roughly speaking, this implies that the WFT cannot capture small v ariations in a small time windo w (or in the case of images, a small patch). 2.3 Continuous wav elet transf orm The WFT f ails because it introduces scale (the width of the windo w) into the analysis. The continuous wa velet transform in v olves scale too, but it considers all possible scalings and av oids the problem faced by the WFT . Again, we begin with a window function ψ (supported on [ − T , 0] ), this time called a mother wa velet. For some ﬁx ed p ≥ 0 , we deﬁne ψ s ( u ) ≡ | s | − p ψ  u s  (10) The scale s is allo wed to be any non-zero real number . W ith this family of wa velets, we deﬁne the continuous wa velet transform (CWT) as ˜ f ( s, t ) ≡ ( f ∗ ψ s )( t ) (11) where ∗ is the continuous con volution operator: ( p ∗ q )( x ) ≡ Z ∞ −∞ p ( u ) q ( x − u ) du (12) The continuous wavelet transform captures variations in f at a particular scale. It provides the foundation for the operation of CNNs, as will be explored ne xt. 3 Scale separation with wa velets Having moti vated the need for a wa velet transform, we will no w construct a feature representation using the wa velet transform. Note that conv olutional neural network are cov ariant to translations because they use con v olutions for linear operators. So we will focus on transformations that linearize diffeomorphisms. Theorem 2 Let φ J ( u ) = 2 − nJ φ (2 − J u ) be an avera ging kernel with R φ ( u ) du = 1 . Her e n is the dimension of the inde x in X , for e xample, n = 2 for images. Let { ψ k } K k =1 be a set of K wavelets with zer o average: R ψ k ( u ) du = 0 , and from them deﬁne ψ j,k ( u ) ≡ 2 − j n ψ k (2 − j u ) . Let Φ J be a featur e transformation deﬁned as Φ J x ( u, j, k ) = | x ∗ ψ j,k | ∗ φ J ( u ) Then Φ J is locally in variant to translations at scale 2 J , and Lipschitz continuous to the actions of diffemorphisms as deﬁned by equation (4) under the following dif feomorphism norm. | g | = 2 − J sup u ∈ R n | g ( u ) | + sup u ∈ R n |∇ g ( u ) | (13) Theorem 2 shows that Φ J satisﬁes the regularity conditions which we seek. Howe ver , it leads to a loss of information due to the averaging with φ J . The lost information is recovered by a hierarchy of wa velet decompositions as discussed ne xt. 2 Contrary to popular belief, the Uncertainty Principle is a mathematical, not physical property . 4 Figure 2: Architecture of the scattering transform (from Estrach [4]) 4 Scattering T ransform Con v olutional Neural Networks transform their input with a series of linear operators and point-wise non-linearities. T o study their properties, we ﬁrst consider a simpler feature transformation, the scat- tering transform introduced by Mallat [ 9 ]. As was discussed in section 1 . 5 , CNNs compute multiple con volutions across channels in each layer; So as a simpliﬁcation, we consider the transformation obtained by con v olving a single channel: x j ( u, k j ) = ρ  ( x j − 1 ( ., k j − 1 ) ∗ W j,h )( u )  (14) Here k j = ( k j − 1 , h ) and h controls the hierarchical structure of the transformation. Speciﬁcally , we can recursiv ely expand the abo ve equation to write x J ( u, k J ) = ρ ( ρ ( . . . ρ ( x ∗ W 1 ,h 1 ) ∗ . . . ) ∗ W J,h J ) (15) This produces a hierarchical transformation with a tree structure rather than a full network. It is possible to show that the abov e transformation has an equi valent representation through wa velet ﬁlters i.e. there exists a sequence p ≡ ( λ 1 , . . . , λ m ) such that x J ( u, k J ) = S J [ p ] x ( u ) ≡ ( U [ p ] x ∗ φ J )( u ) ≡ ( ρ ( ρ ( . . . ρ ( x ∗ ψ λ 1 ) ∗ . . . ) ∗ ψ λ m ) ∗ φ J )( u ) (16) where the ψ λ i s are suitably choses wa velet ﬁlters and φ J is the av eraging ﬁlter deﬁned in Theorem 2 . This is the wav elet scattering transfom; its structure is similar to that of a con volutional neural network as sho wn in ﬁgure 2 , but its ﬁlters are deﬁned by ﬁxed w av elet functions instead of being learned from the data. Further , we hav e the following theorem about the scattering transform. Theorem 3 Let S J [ p ] be the scattering transform as deﬁned by equation (16) . Then ther e exists C > 0 suc h that for all diffeomorphisms g , and all L 2 ( R n ) signals x , k S J [ p ] g .x − S J [ p ] x k ≤ C m | g |k x k (17) with the diffeomorphism norm | g | given by equation (13) . 5 Theorem 3 sho ws that the scattering transform is Lipschitz continuous to the action of diffemorphisms. So the action of small deformations is linearized over scattering coef ﬁcients. Further , because of its structure, it is naturally locally in v ariant to translations. It has several other desirable properties [ 4 ], and can be used to achiev e state of the art classiﬁcation errors on the MNIST digits dataset [2]. 5 General Con volutional Neural Network Architectur es The scattering transform described in the previous section provides a simple vie w of a general con v olutional neural netowrk. While it provides intuition behind the working of CNNs, the transfor - mation suf fers from high variance and loss of information because we only consider single channel con volutions. T o analyze the properties of general CNN architectures, we must allow for channel combinations. Mallat [ 10 ] extends pre viously introduced tools to develop a mathematical frame work for this analysis. The theory is, howe ver , out of the scope of this paper . At a high level, the e xtension is achiev ed by replacing the requirement of contractions and inv ariants to translations by contractions along adaptive groups of local symmetries. Further , the wa velets are replaced by adapted ﬁlter weights similar to deep learning models. 6 Conclusion In this paper , we tried to analyze the properties of con volutional neural networks. A simpliﬁed model, the scattering transform was introduced as a ﬁrst step to wards understanding CNN operations. W e saw that the feature transformation is built on top of wa velet transforms which separate v ariations at different scales using a wav elet transform. The analysis of general CNN architectures was not considered in this paper , b ut ev en this analysis is only a ﬁrst step towards a full mathematical understanding of con v olutional neural networks. References [1] Joakim Andén and Stéphane Mallat. Deep scattering spectrum. Signal Processing , IEEE T ransactions on , 62(16):4114–4128, 2014. [2] Joan Bruna and Stéphane Mallat. In variant scattering con v olution networks. P attern Analysis and Machine Intelligence , IEEE T ransactions on , 35(8):1872–1886, 2013. [3] Corinna Cortes and Vladimir V apnik. Support-vector networks. Machine learning , 20(3): 273–297, 1995. [4] Joan Bruna Estrach. Scattering representations for recognition. [5] Kaiser Gerald. A friendly guide to wa velets, 1994. [6] B Boser Le Cun, John S Denker , D Henderson, Richard E Ho ward, W Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information pr ocessing systems . Citeseer, 1990. [7] Y ann LeCun, Léon Bottou, Y oshua Bengio, and Patrick Haf fner . Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11):2278–2324, 1998. [8] Y ann LeCun, Y oshua Bengio, and Geoffre y Hinton. Deep learning. Natur e , 521(7553):436–444, 2015. [9] Stéphane Mallat. Group inv ariant scattering. Communications on Pure and Applied Mathematics , 65(10):1331–1398, 2012. [10] Stéphane Mallat. Understanding deep conv olutional networks. arXiv pr eprint arXiv:1601.04920 , 2016. 6

Understanding Convolutional Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment