On the Expressive Power of Deep Learning: A Tensor Analysis

It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evid…

Authors: Nadav Cohen, Or Sharir, Amnon Shashua

On the Expressive Power of Deep Learning: A Tensor Analysis
On the Expr essiv e P ower of Deep Lear ning: A T ensor Analysis Nada v Cohen C O H E N NA D A V @ C S . H U J I . AC . I L Or Sharir O R . S H A R I R @ C S . H U J I . AC . I L Amnon Shashua S H A S H UA @ C S . H U J I . A C . I L The Hebr e w University of J erusalem Abstract It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more ef ficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical e vidence supporting this belief, theoretical justifications to date are limited. In particular , they do not account for the locality , sharing and pooling constructs of con volutional networks, the most successful deep learning architecture to date. In this w ork we derive a deep netw ork architecture based on arithmetic circuits that inherently employs locality , sharing and pooling. An equiv alence between the networks and hierarchical tensor factorizations is established. W e show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical T ucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or e ven approximated) by a shallow network. Since log-space computation transforms our netw orks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory de veloped in this paper shed new light on various practices and ideas emplo yed by the deep learning community . Keyw ords: Deep Learning , Expressive P ower , Arithmetic Cir cuits , T ensor Decompositions 1. Intr oduction The expressiv e power of neural networks is achie v ed through depth. There is mounting empirical e vidence that for a giv en budget of resources (e.g. neurons), the deeper one goes, the better the e ventual performance will be. Ho we ver , existing theoretical ar guments that support this empirical finding are limited. There have been many attempts to theoretically analyze function spaces gen- erated by network architectures, and their dependency on network depth and size. The prominent approach for justifying the po wer of depth is to sho w that deep networks can efficiently express functions that would require shallo w networks to hav e super-polynomial size. W e refer to such scenarios as instances of depth efficiency . Unfortunately , e xisting results dealing with depth ef fi- ciency (e.g. Hastad ( 1986 ); H ˚ astad and Goldmann ( 1991 ); Delalleau and Bengio ( 2011 ); Martens and Medabalimi ( 2014 )) typically apply to specific netw ork architectures that do not resemble ones commonly used in practice. In particular , none of these results apply to conv olutional networks ( LeCun and Bengio ( 1995 )), which represent the most empirically successful and widely used deep learning architecture to date. A further limitation of current results is that they merely sho w ex- istence of depth efficiency (i.e. of functions that are efficiently realizable with a certain depth b ut cannot be efficiently realized with shallower depths), without providing any information as to ho w frequent this property is. These shortcomings of current theory are the ones that motiv ated our work. c  N. Cohen, O. Sharir & A. Shashua. C O H E N S H A R I R S H A S H U A The architectural features that specialize con v olutional networks compared to classic feed- forward fully-connected networks are threefold. The first feature, locality , refers to the connection of a neuron only to neighboring neurons in the preceding layer , as opposed to having the entire layer dri v e it. In the context of image processing (the most common application of con volutional networks), locality is believ ed to reflect the inherent compositional structure of data – the closer pixels are in an image, the more likely they are to be correlated. The second architectural feature of con v olutional netw orks is sharing , which means that dif ferent neurons in the same layer , connected to different neighborhoods in the preceding layer , share the same weights. Sharing, which together with locality giv es rise to conv olution, is moti v ated by the fact that in natural images, the semantic meaning of a pattern often does not depend on its location (i.e. two identical patterns appearing in dif ferent locations of an image often conv e y the same semantic content). Finally , the third archi- tectural idea of con volutional networks is pooling , which is essentially an operator that decimates layers, replacing neural activ ations in a spatial window by a single v alue (e.g. their maximum or av erage). In the context of images, pooling induces in v ariance to translations (which often do not af fect semantic content), and in addition is belie v ed to create a hierarchy of abstraction in the pat- terns neurons respond to. The three architectural elements of locality , sharing and pooling, which hav e facilitated the great success of con volutional networks, are all lacking in e xisting theoretical studies of depth ef ficiency . In this paper we introduce a convolutional arithmetic cir cuit architecture that incorporates lo- cality , sharing and pooling. Arithmetic circuits (also kno wn as Sum-Product Networks, Poon and Domingos ( 2011 )) are networks with tw o types of nodes: sum nodes, which compute a weighted sum of their inputs, and product nodes, computing the product of their inputs. W e use sum nodes to implement con v olutions (locality with sharing), and product nodes to realize pooling. The models we arriv e at may be vie wed as con v olutional networks with product pooling and linear point-wise ac- ti v ation. They are attracti ve on three accounts. First, as discussed in app. E , con v olutional arithmetic circuits are equi v alent to SimNets, a new deep learning architecture that has recently demonstrated promising empirical results on v arious image recognition benchmarks ( Cohen et al. ( 2016 )). Sec- ond, as we sho w in sec. 3 , con volutional arithmetic circuits are realizations of hierarchical tensor decompositions (see Hackbusch ( 2012 )), opening the door to v arious mathematical and algorithmic tools for their analysis and implementation. Third, the depth efficienc y of conv olutional arithmetic circuits, which we analyze in sec. 4 , was shown in the subsequent work of Cohen and Shashua ( 2016 ) to be superior to the depth ef ficiency of the popular con volutional rectifier networks, namely con v olutional networks with rectified linear (ReLU) acti vation and max or a verage pooling. Employing machinery from measure theory and matrix algebra, made av ailable through their connection to hierarchical tensor decompositions, we prove a number of fundamental results con- cerning the depth efficienc y of our con volutional arithmetic circuits. Our main theoretical result (thm. 1 and corollary 2 ) states that besides a ne gligible (zer o measure) set, all functions that can be realized by a deep network of polynomial size, r equir e e xponential size in or der to be r ealized, or even appr oximated, by a shallow network . When translated to the vie wpoint of tensor decom- positions, this implies that almost all tensors realized by Hierarchical T ucker (HT) decomposition ( Hackbusch and K ¨ uhn ( 2009 )) cannot be ef ficiently realized by the classic CP (rank-1) decompo- sition. T o the best of our knowledge, this result is unkno wn to the tensor analysis community , in which the advantage of HT ov er CP is typically demonstrated through specific examples of tensors that can be ef ficiently realized by the former and not by the latter . F ollo wing our main result, we present a generalization (thm. 3 and corollary 4 ) that compares networks of arbitrary depths, show- 2 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S ing that the amount of resources one has to pay in order to maintain representational power while trimming down layers of a network grows double exponentially w .r .t. the number of layers cut off. W e also characterize cases in which dropping a single layer bears an e xponential price. The remainder of the paper is organized as follo ws. In sec. 2 we briefly revie w notations and mathematical background required in order to follow our work. This is followed by sec. 3 , which presents our con volutional arithmetic circuits and establishes their equiv alence with tensor decom- positions. Our theoretical analysis is co vered in sec. 4 . Finally , sec. 5 concludes. In order to keep the manuscript at a reasonable length, we defer our detailed survey of related work to app. D , cov- ering works on the depth efficienc y of boolean circuits, arithmetic circuits and neural networks, as well as dif ferent applications of tensor analysis in the field of deep learning. 2. Preliminaries W e be gin by establishing notational con v entions that will be used throughout the paper . W e denote vectors using bold typeface, e.g. v ∈ R s . The coordinates of such a vector are referenced with regular typeface and a subscript, e.g. v i ∈ R . This is not to be confused with bold typeface and a subscript, e.g. v i ∈ R s , which represents a vector that belongs to some sequence. T ensors (multi- dimensional arrays) are denoted by the letters “ A ” and “B” in calligraphic typeface, e.g. A , B ∈ R M 1 ×···× M N . A specific entry in a tensor will be referenced with subscripts, e.g. A d 1 ...d N ∈ R . Superscripts will be used to denote individual objects within a collection. For example, v ( i ) stands for vector i and A y stands for tensor y . In cases where the collection of interest is indexed by multiple coordinates, we will hav e multiple superscripts referencing individual objects, e.g. a l,j,γ will stand for v ector ( l, j, γ ) . As shorthand for the Cartesian product of the Euclidean space R s with itself N times, we will use the notation ( R s ) N . Finally , for a positi v e integer k we use the shorthand [ k ] to denote the set { 1 , . . . , k } . W e no w turn to establish a baseline, i.e. to present basic definitions and results, in the broad and comprehensi ve field of tensor analysis. W e list here only the essentials required in order to follow the paper , referring the interested reader to Hackbusch ( 2012 ) for a more complete introduction to the field 1 . The most straightforward way to view a tensor is simply as a multi-dimensional array: A d 1 ,...,d N ∈ R where i ∈ [ N ] , d i ∈ [ M i ] . The number of indexing entries in the array , which are also called modes , is referred to as the or der of the tensor . The term dimension stands for the number of v alues an index can take in a particular mode. For e xample, the tensor A appearing abo ve has order N and dimension M i in mode i , i ∈ [ N ] . The space of all possible configurations A can take is called a tensor space and is denoted, quite naturally , by R M 1 ×···× M N . A central operator in tensor analysis is the tensor pr oduct , denoted ⊗ . This operator intakes two tensors A and B of orders P and Q respecti vely , and returns a tensor A ⊗ B of order P + Q , defined by: ( A ⊗ B ) d 1 ...d P + Q = A d 1 ...d P · B d P +1 ...d P + Q . Notice that in the case P = Q = 1 , the tensor product reduces to an outer product between v ectors. Specifically , v ⊗ u – the tensor product between u ∈ R M 1 and v ∈ R M 2 , is no other than the rank-1 matrix vu > ∈ R M 1 × M 2 . In this context, we will often use the shorthand ⊗ N i =1 v ( i ) to denote the joint tensor product v (1) ⊗· · ·⊗ v ( N ) . T ensors of the form ⊗ N i =1 v ( i ) are called pure or elementary , and are regarded as having rank-1 (assuming v ( i ) 6 = 0 ∀ i ). It is not dif ficult to see that an y tensor can be e xpressed as a sum of rank-1 1. The definitions we gi ve are concrete special cases of the more abstract algebraic definitions given in Hackb usch ( 2012 ). W e limit the discussion to these special cases since they suf fice for our needs and are easier to grasp. 3 C O H E N S H A R I R S H A S H U A tensors: A = Z X z =1 v (1) z ⊗ · · · ⊗ v ( N ) z , v ( i ) z ∈ R M i (1) A representation as abo ve is called a CANDECOMP/P ARAF A C decomposition of A , or in short, a CP decomposition 2 . The CP-rank of A is defined as the minimum number of terms in a CP decomposition, i.e. as the minimal Z for which eq. 1 can hold. Notice that for a tensor of order 2, i.e. a matrix, this definition of CP-rank coincides with that of standard matrix rank. A symmetric tensor is one that is in v ariant to permutations of its indices. Formally , a ten- sor A of order N which is symmetric will ha ve equal dimension M in all modes, and for ev- ery permutation π : [ N ] → [ N ] and indices d 1 . . .d N ∈ [ M ] , the following equality will hold: A d π (1) ...d π ( N ) = A d 1 ...d N . Note that for a vector v ∈ R M , the tensor ⊗ N i =1 v ∈ R M ×···× M is symmetric. Moreover , e very symmetric tensor may be expressed as a linear combination of such (symmetric rank-1) tensors: A = P Z z =1 λ z · v z ⊗ · · · ⊗ v z . This is referred to as a symmetric CP decomposition , and the symmetric CP-r ank is the minimal Z for which such a decomposition exists. Since a symmetric CP decomposition is in particular a standard CP decomposition, the symmetric CP-rank of a symmetric tensor is always greater or equal to its standard CP-rank. Note that for the case of symmetric matrices (order- 2 tensors) the symmetric CP-rank and the original CP-rank are always equal. A repeating concept in this paper is that of measur e zer o . More broadly , our analysis is framed in measure theoretical terms. While an introduction to the field is be yond the scope of the paper (the interested reader is referred to Jones ( 2001 )), it is possible to intuiti vely grasp the ideas that form the basis to our claims. When dealing with subsets of a Euclidean space, the standard and most natural measure in a sense is called the Lebesgue measure . This is the only measure we consider in our analysis. A set of (Lebesgue) measure zero can be thought of as having zero “volume” in the space of interest. For example, the interval between (0 , 0) and (1 , 0) has zero measure as a subset of the 2D plane, but has positive measure as a subset of the 1D x -axis. An alternativ e way to view a zero measure set S follows the property that if one dra ws a random point in space by some continuous distribution, the probability of that point hitting S is necessarily zero. A related term that will be used throughout the paper is almost everywher e , which refers to an entire space excluding, at most, a set of zero measure. 3. Con volutional Arithmetic Cir cuits W e consider the task of classifying an instance X = ( x 1 , . . . , x N ) , x i ∈ R s , into one of the categories Y := { 1 , . . . , Y } . Representing instances as collections of vectors is natural in man y applications. In the case of image processing for example, X may correspond to an image, and x 1 . . . x N may correspond to vector arrangements of (possibly ov erlapping) patches around pix- els. As customary , classification is carried out through maximization of per-label score func- tions { h y } y ∈Y , i.e. the predicted label for the instance X will be the index y ∈ Y for which the score value h y ( X ) is maximal. Our attention is thus directed to functions o ver the instance space X := { ( x 1 , . . . , x N ) : x i ∈ R s } = ( R s ) N . W e define our hypotheses space through the following 2. CP decomposition is regarded as the classic and most basic tensor decomposition, dating back to the beginning of the 20’th century (see K olda and Bader ( 2009 ) for a historic survey). 4 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S     , d i r ep i d f   x i n p u t r e p r e s e n t a t i o n 1 x 1 c o n v g l o b a l p o o l i n g d e n s e ( o u t p u t ) h i d d e n l a y e r i x M Z Z Y     , , , , : zi co n v i z r ep i  a     1 , N i p o o l z co n v i z        ,: y ou t y po ol  a X Figure 1: CP model – con volutional arithmetic circuit implementing CP (rank-1) decomposition. representation of score functions: h y ( x 1 , . . . , x N ) = M X d 1 ...d N =1 A y d 1 ,...,d N N Y i =1 f θ d i ( x i ) (2) f θ 1 . . .f θ M : R s → R are referred to as r epr esentation functions , selected from a parametric family F = { f θ : R s → R } θ ∈ Θ . Natural choices for this family are wa velets, radial basis functions ( Gaus- sians ), and affine functions followed by point-wise activ ation ( neur ons ). The coefficient tensor A y has order N and dimension M in each mode. Its entries correspond to a basis of M N point-wise product functions { ( x 1 , . . . , x N ) 7→ Q N i =1 f θ d i ( x i ) } d 1 ...d N ∈ [ M ] . W e will often consider fix ed lin- early independent representation functions f θ 1 . . .f θ M . In this case the point-wise product functions are linearly independent as well (see app. C.1 ), and we hav e a one to one correspondence between score functions and coefficient tensors. T o keep the manuscript concise, we defer the deriv ation of our hypotheses space (eq. 2 ) to app. C , noting here that it arises naturally from the notion of tensor products between L 2 spaces. Our ev entual aim is to realize score functions h y with a layered network architecture. As a first step along this path, we notice that h y ( x 1 , . . . , x N ) is fully determined by the activ ations of the M representation functions f θ 1 . . .f θ M on the N input vectors x 1 . . . x N . In other words, gi v en { f θ d ( x i ) } d ∈ [ M ] ,i ∈ [ N ] , the score h y ( x 1 , . . . , x N ) is independent of the input. It is thus natural to con- sider the computation of these M · N numbers as the first layer of our networks. This layer , referred to as the r epr esentation layer , may be concei v ed as a con volutional operator with M channels, each corresponding to a dif ferent function applied to all input vectors (see fig. 1 ). Once we ha v e constrained our score functions to have the structure depicted in eq. 2 , learning a classifier reduces to estimation of the parameters θ 1 . . .θ M , and the coefficient tensors A 1 . . . A Y . The computational challenge is that the latter tensors are of order N (and dimension M in each mode), ha ving an exponential number of entries ( M N each). In the next subsections we utilize tensor decompositions (factorizations) to address this computational challenge, and show ho w they are naturally realized by con v olutional arithmetic circuits. 5 C O H E N S H A R I R S H A S H U A 3.1. Shallow Network as a CP Decomposition of A y The most straightforward way to factorize a tensor is through a CP (rank-1) decomposition (see sec. 2 ). Consider a joint CP decomposition for the coef ficient tensors {A y } y ∈Y : A y = Z X z =1 a y z · a z , 1 ⊗ · · · ⊗ a z ,N (3) where a y ∈ R Z for y ∈ Y ( a y z stands for entry z of a y ), and a z ,i ∈ R M for i ∈ [ N ] , z ∈ [ Z ] . The decomposition is joint in the sense that the same vectors a z ,i are shared across all classes y . Clearly , if we set Z = M N this model is uni versal, i.e. any tensors A 1 . . . A Y may be represented. Substituting our CP decomposition (eq. 3 ) into the expression for the score functions in eq. 2 , we obtain: h y ( X ) = Z X z =1 a y z N Y i =1 M X d =1 a z ,i d f θ d ( x i ) ! From this we conclude that the netw ork illustrated in fig. 1 implements a classifier (score functions) under the CP decomposition in eq. 3 . W e refer to this network as CP model . The network consists of a representation layer followed by a single hidden layer , which in turn is followed by the output. The hidden layer begins with a 1 × 1 con v operator , which is simply a 3D con volution with Z channels and recepti v e field 1 × 1 . The con volution may operate without coef ficient sharing, i.e. the filters that generate feature maps by sliding across the previous layer may ha v e dif ferent coef ficients at different spatial locations. This is often referred to in the deep learning community as a locally- connected operator (see T aigman et al. ( 2014 )). T o obtain a standard con volutional operator , simply enforce coef ficient sharing by constraining the vectors a z ,i in the CP decomposition (eq. 3 ) to be equal to each other for different v alues of i (this setting is discussed in sec. 3.3 ). F ollo wing con v operator , the hidden layer includes global product pooling. Feature maps generated by con v are reduced to singletons through multiplication of their entries, creating a vector of dimension Z . This vector is then mapped into the Y network outputs through a final dense linear layer . T o recap, CP model (fig. 1 ) is a shallo w (single hidden layer) con v olutional arithmetic circuit that realizes the CP decomposition (eq. 3 ). It is univ ersal, i.e. it can realize any coefficient tensors with large enough size ( Z ). Unfortunately , since the CP-rank of a generic tensor is e xponential in its order (see Hackbusch ( 2012 )), the size required for CP model to be universal is exponential ( Z exponential in N ). 3.2. Deep Network as a Hierarchical Decomposition of A y In this subsection we present a deep network that corresponds to the recently introduced Hierar- chical T ucker tensor decomposition ( Hackbusch and K ¨ uhn ( 2009 )), which we refer to in short as HT decomposition . The netw ork, dubbed HT model , is uni v ersal. Specifically , any set of tensors A y represented by CP model can be represented by HT model with only a polynomial penalty in terms of resources. The adv antage of HT model, as we sho w in sec. 4 , is that in almost all cases it generates tensors that require an exponential size in order to be realized, or e ven approximated, by CP model. Put differently , if one draws the weights of HT model by some continuous distribu- tion, with probability one, the resulting tensors cannot be approximated by a polynomial CP model. Informally , this implies that HT model is exponentially more expressi v e than CP model. 6 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S     , d i r ep i d f   x i n p u t r e p r e s e n t a t i o n 1 x 1 c o n v p o o l i n g 1 x 1 c o n v p o o l i n g d e n s e ( o u t p u t ) h i d d e n l a y e r 0 h i d d e n l a y e r L - 1 ( L = l o g 2 N ) i x M 0 r 0 r 1 L r  1 L r  Y     0 , , 0 , , , : j co n v j r ep j    a       00 ' 2 1 , 2 , ' , j j j p o o l j co n v j           11 ' 1 , 2 ', LL j poo l c onv j          , 1 ,: Ly L o u t y p o o l   a X Figure 2: HT model – con volutional arithmetic circuit implementing hierarchical decomposition. HT model is based on the hierarchical tensor decomposition in eq. 4 , which is a special case of the HT decomposition as presented in Hackbusch and K ¨ uhn ( 2009 ) (in the latter’ s terminology , we restrict the matrices A l,j,γ to be diagonal). Our construction and theoretical results apply to the general HT decomposition as well, with the specialization done merely to bring forth a network that resembles current con v olutional networks 3 . φ 1 ,j,γ = r 0 X α =1 a 1 ,j,γ α a 0 , 2 j − 1 ,α ⊗ a 0 , 2 j,α · · · φ l,j,γ = r l − 1 X α =1 a l,j,γ α φ l − 1 , 2 j − 1 ,α | {z } order 2 l − 1 ⊗ φ l − 1 , 2 j,α | {z } order 2 l − 1 · · · φ L − 1 ,j,γ = r L − 2 X α =1 a L − 1 ,j,γ α φ L − 2 , 2 j − 1 ,α | {z } order N 4 ⊗ φ L − 2 , 2 j,α | {z } order N 4 A y = r L − 1 X α =1 a L,y α φ L − 1 , 1 ,α | {z } order N 2 ⊗ φ L − 1 , 2 ,α | {z } order N 2 (4) The decomposition in eq. 4 recursiv ely constructs the coefficient tensors {A y } y ∈ [ Y ] by assem- bling vectors { a 0 ,j,γ } j ∈ [ N ] ,γ ∈ [ r 0 ] into tensors { φ l,j,γ } l ∈ [ L − 1] ,j ∈ [ N/ 2 l ] ,γ ∈ [ r l ] in an incremental fashion. The index l stands for the lev el in the decomposition, j represents the “location” within level l , and γ corresponds to the indi vidual tensor in lev el l and location j . r l is referred to as level- l rank , and is defined to be the number of tensors in each location of lev el l (we denote for completeness r L := Y ). The tensor φ l,j,γ has order 2 l , and we assume for simplicity that N – the order of A y , is a power of 2 (this is merely a technical assumption also made in Hackbusch and K ¨ uhn ( 2009 ), it does not limit the generality of our analysis). The parameters of the decomposition are the final le vel weights { a L,y ∈ R r L − 1 } y ∈ [ Y ] , the in- termediate le vels’ weights { a l,j,γ ∈ R r l − 1 } l ∈ [ L − 1] ,j ∈ [ N/ 2 l ] ,γ ∈ [ r l ] , and the first le vel v ectors { a 0 ,j,γ ∈ R M } j ∈ [ N ] ,γ ∈ [ r 0 ] . This totals at N · M · r 0 + P L − 1 l =1 N 2 l · r l − 1 · r l + Y · r l − 1 indi vidual parame- 3. If we had not constrained A l,j,γ to be diagonal, pooling operations would in volve entries from dif ferent channels. 7 C O H E N S H A R I R S H A S H U A ters, and if we assume equal ranks r := r 0 = · · · = r L − 1 , the number of parameters becomes N · M · r + N · r 2 + Y · r . The hierarchical decomposition (eq. 4 ) is uni v ersal, i.e. with large enough ranks r l it can rep- resent any tensors. Moreov er , it is a super -set of the CP decomposition (eq. 3 ). That is to say , all tensors representable by a CP decomposition ha ving Z components are also representable by a hierarchical decomposition with ranks r 0 = r 1 = · · · = r L − 1 = Z 4 . Note that this comes with a polynomial penalty – the number of parameters increases from N · M · Z + Z · Y in the CP decomposition, to N · M · Z + Z · Y + N · Z 2 in the hierarchical decomposition. Ho we v er , as we sho w in sec. 4 , the gain in expressi ve power is e xponential. Plugging the e xpression for A y in our hierarchical decomposition (eq. 4 ) into the score function h y gi ven in eq. 2 , we obtain the network displayed in fig. 2 – HT model. This network includes a representation layer follo wed by L = log 2 N hidden layers which in turn are follo wed by the output. As in the shallow CP model (fig. 1 ), the hidden layers consist of 1 × 1 con v operators followed by product pooling. The difference is that instead of a single hidden layer collapsing the entire spatial structure through global pooling, hidden layers now pool over size- 2 windo ws, decimating feature maps by a factor of two (no o verlaps). After L = log 2 N such layers feature maps are reduced to singletons, and we arriv e at a 1D structure with r L − 1 nodes. This is then mapped into Y network outputs through a final dense linear layer . W e note that the network’ s size- 2 pooling windows (and the resulting number of hidden layers L = log 2 N ) correspond to the fact that our hierarchical decomposition (eq. 4 ) is based on a full binary tree over modes, i.e. it combines (through tensor product) two tensors at a time. W e focus on this setting solely for simplicity of presentation, and since it is the one presented in Hackbusch and K ¨ uhn ( 2009 ). Our analysis (sec. 4 ) could easily be adapted to hierarchical decompositions based on other trees (taking tensor products between more than two tensors at a time), and that would correspond to netw orks with different pooling window sizes and resulting depths. HT model (fig. 2 ) is conceptually di vided into two parts. The first is the representation layer , transforming input vectors x 1 . . . x N into N · M real-v alued scalars { f θ d ( x i ) } i ∈ [ N ] ,d ∈ [ M ] . The sec- ond and main part of the network, which we view as an “inference” engine, is the con volutional arithmetic circuit that tak es the N · M measurements produced by the representation layer , and ac- cordingly computes Y class scores at the output layer . T o recap, we have now a deep network (fig. 2 ), which we refer to as HT model, that computes the score functions h y (eq. 2 ) with coefficient tensors A y hierarchically decomposed as in eq. 4 . The network is uni versal in the sense that with enough channels r l , an y tensors may be represented. Moreov er , the model is a super-set of the shallow CP model presented in sec. 3.1 . The question of depth ef ficiency now naturally arises. In particular , we would lik e to kno w if there are functions that may be represented by a polynomially sized deep HT model, yet require exponential size from the shallo w CP model. The answer , as described in sec. 4 , is that almost all functions realizable by HT model meet this property . In other words, the set of functions realizable by a polynomial CP model has measur e zer o in the space of functions realizable by a giv en polynomial HT model. 4. T o see this, simply assign the first lev el vectors a 0 ,j,γ with CP’ s basis vectors, the last level weights with CP’ s per-class weights, and the intermediate le v els’ weights with indicator vectors. 8 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S 3.3. Shared Coefficients f or Con volution The 1 × 1 conv operator in our networks (see fig. 1 and 2 ) implements a local linear transformation with coefficients generally being location-dependent. In the special case where coefficients do not depend on location, i.e. remain fixed across space, the local linear transformation becomes a stan- dard con v olution. W e refer to this setting as coefficient sharing . Sharing is a widely used structural constraint, one of the pillars behind the successful conv olutional netw ork architecture. In the con- text of image processing (prominent application of con volutional netw orks), sharing is motiv ated by the observ ation that in natural images, the semantic content of a pattern often does not depend on its location. In this subsection we explore the effect of sharing on the expressiv eness of our networks, or more specifically , on the coefficient tensors A y they can represent. For CP model, coef ficient sharing amounts to setting a z := a z , 1 = · · · = a z ,N in the CP decomposition (eq. 3 ), transforming the latter to a symmetric CP decomposition: A y = Z X z =1 a y z · a z ⊗ · · · ⊗ a z | {z } N times , a z ∈ R M , a y ∈ R Z CP model with sharing is not uni v ersal (not all tensors A y are representable, no matter ho w lar ge Z is allo wed to be) – it can only represent symmetric tensors. In the case of HT model, sharing amounts to applying the following constraints on the hierarchi- cal decomposition in eq. 4 : a l,γ := a l, 1 ,γ = · · · = a l, N / 2 l ,γ for e very l = 0 . . .L − 1 and γ = 1 . . .r l . Note that in this case univ ersality is lost as well, but nonetheless generated tensors are not limited to be symmetric, already demonstrating an expressi ve adv antage of deep models o v er shallo w ones. In sec. 4 we take this further by sho wing that the shared HT model is exponentially more e xpressi ve than CP model, e ven if the latter is not constrained by sharing. 4. Theorems of Network Capacity The first contribution of this paper , presented in sec. 3 , is the equiv alence between deep learning architectures successfully employed in practice, and tensor decompositions. Namely , we showed that con v olutional arithmetic circuits as in fig. 2 , which are in fact SimNets that hav e demonstrated promising empirical performance (see app. E ), may be formulated as hierarchical tensor decompo- sitions. As a second contribution, we make use of the established link between arithmetic circuits and tensor decompositions, combining theoretical tools from these two worlds, to prov e results that are of interest to both deep learning and tensor analysis communities. This is the focus of the current section. The fundamental theoretical result prov en in this paper is the follo wing: Theorem 1 Let A y be a tensor of or der N and dimension M in eac h mode, gener ated by the r ecur sive formulas in eq. 4 . Define r := min { r 0 , M } , and consider the space of all possible configurations for the parameter s of the composition – { a l,j,γ } l,j,γ . In this space, the gener ated tensor A y will have CP-rank of at least r N / 2 almost e verywher e (w .r .t. Lebesgue measur e). Put differ ently , the configurations for which the CP-rank of A y is less than r N / 2 form a set of measur e zer o. The exact same r esult holds if we constr ain the composition to be “shar ed”, i.e . set a l,j,γ ≡ a l,γ and consider the space of { a l,γ } l,γ configurations. From the perspecti ve of deep learning, thm. 1 leads to the follo wing corollary: 9 C O H E N S H A R I R S H A S H U A Corollary 2 Given linearly independent repr esentation functions { f θ d } d ∈ [ M ] , randomizing the weights of HT model (sec. 3.2 ) by a continuous distribution induces score functions h y that with pr obability one, cannot be appr oximated arbitrarily well (in L 2 sense) by a CP model (sec. 3.1 ) with less than min { r 0 , M } N / 2 hidden channels. This r esult holds even if we constr ain HT model with weight sharing (sec. 3.3 ) while leaving CP model in its gener al form. That is to say , besides a ne gligible set, all functions that can be realized by a polynomially sized HT model (with or without weight sharing), r equir e e xponential size in or der to be r ealized, or e ven appr oximated, by CP model . Most of the previous works relating to depth ef ficiency (see app. D ) merely sho w e xistence of functions that separate depths (i.e. that are ef ficiently realizable by a deep network yet require super-polynomial size from shallow networks). Corollary 2 on the other hand establishes depth efficienc y for almost all functions that a deep network can implement. Equally importantly , it applies to deep learning architectures that are being successfully employed in practice (SimNets – see app. E ). Adopting the vie wpoint of tensor analysis, thm. 1 states that besides a ne gligible set, all tensors realized by HT (Hierarchical T ucker) decomposition cannot be represented by the classic CP (rank- 1) decomposition if the latter has less than an exponential number of terms 5 . T o the best of our kno wledge, this result has ne ver been proved in the tensor analysis community . In the original paper introducing HT decomposition ( Hackbusch and K ¨ uhn ( 2009 )), as a moti v ating example, the authors present a specific tensor that is ef ficiently realizable by HT decomposition while requiring an e xponential number of terms from CP decomposition 6 . Our result strengthens this moti v ation considerably , sho wing that it is not just one specific tensor that fa vors HT o ver CP , b ut rather, almost all tensors realizable by HT exhibit this preference. T aking into account that any tensor realized by CP can also be realized by HT with only a polynomial penalty in the number of parameters (see sec. 3.2 ), this implies that in an asymptotic sense, HT decomposition is exponentially more ef ficient than CP decomposition. 4.1. Proof Sketches The complete proofs of thm. 1 and corollary 2 are giv en in app. B . W e pro vide here an outline of the main tools employed and ar guments made along these proofs. T o pro v e thm. 1 we combine approaches from the worlds of circuit complexity and tensor de- compositions. The first class of machinery we employ is matrix algebra , which has prov en to be a po werful source of tools for analyzing the comple xity of circuits. F or example, arithmetic circuits hav e been analyzed through what is called the partial deriv ati v e matrix (see Raz and Y ehudayof f ( 2009 )), and for boolean circuits a widely used tool is the communication matrix (see Karchmer ( 1989 )). W e gain access to matrix algebra by arranging tensors that take part in the CP and HT decompositions as matrices, a process often referred to as matricization . W ith matricization, the tensor product translates to the Kronecker product, and the properties of the latter become readily av ailable. The second tool-set we mak e use of is measur e theory , which pre v ails in the study of ten- sor decompositions, but is much less frequent in analyses of circuit complexity . In order to frame 5. As stated in sec. 3.2 , the decomposition in eq. 4 to which thm. 1 applies is actually a special case of HT decomposition as introduced in Hackbusch and K ¨ uhn ( 2009 ). Howev er , the theorem and its proof can easily be adapted to account for the general case. W e focus on the special case merely because it corresponds to conv olutional arithmetic circuit architectures used in practice. 6. The same motiv ating e xample is giv en in a more recent textbook introducing tensor analysis ( Hackb usch ( 2012 )). 10 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S a problem in measure theoretical terms, one obviously needs to define a measure space of inter - est. For tensor decompositions, the straightforward space to focus on is that of the decomposition v ariables. For general circuits on the other hand, it is often unclear if defining a measure space is at all appropriate. Ho we ver , when circuits are considered in the context of machine learning they are usually parameterized, and defining a measure space on top of these parameters is an effecti ve approach for studying the pre v alence of v arious properties in hypotheses spaces. Our proof of thm. 1 tra v erses through the follo wing pat h. W e begin by sho wing that matricizing a rank-1 tensor produces a rank-1 matrix. This implies that the matricization of a tensor generated by a CP decomposition with Z terms has rank at most Z . W e then turn to show that the matricization of a tensor generated by the HT decomposition in eq. 4 has rank at least min { r 0 , M } N/ 2 almost e verywhere. This is done through induction over the lev els of the decomposition ( l = 1 . . .L ). For the first level ( l = 1 ), we use a combination of measure theoretical and linear algebraic arguments to sho w that the generated matrices have maximal rank ( min { r 0 , M } ) almost everywhere. For the induction step, the facts that under matricization tensor product translates into Kronecker product, and that the latter increases ranks multiplicati v ely 7 , imply that matricization ranks in the current le vel are generally equal to those in the pre vious le vel squared. Measure theoretical claims are then made to ensure that this indeed takes place almost e v erywhere. T o prov e corollary 2 based on thm. 1 , we need to sho w that the inability of CP model to realize a tensor generated by HT model, implies that the former cannot approximate score functions produced by the latter . In general, the set of tensors expressible by a CP decomposition is not topologically closed 8 , which implies that a-priori, it may be that CP model can approximate tensors generated by HT model ev en though it cannot realize them. Howe v er , since the proof of thm. 1 was achieved through separation of matrix rank, distances are indeed positiv e and CP model cannot approximate HT model’ s tensors almost alw ays. T o translate from tensors to score functions, we simply note that in a finite-dimensional Hilbert space conv er gence in norm implies conv er gence in coef ficients under any basis. Therefore, in the space of score functions (eq. 2 ) conv er gence in norm implies con v er gence in coefficients under the basis { ( x 1 , . . . , x N ) 7→ Q N i =1 f θ d i ( x i ) } d 1 ...d N ∈ [ M ] . That is to say , it implies conv er gence in coef ficient tensors. 4.2. Generalization Thm. 1 and corollary 2 compare the expressi v e po wer of the deep HT model (sec. 3.2 ) to that of the shallow CP model (sec. 3.1 ). One may argue that such an analysis is lacking, as it does not con v ey information reg arding the importance of each indi vidual layer . In particular , it does not shed light on the advantage of very deep networks, which at present pro vide state of the art recognition accuracy , compared to networks of more moderate depth. For this purpose we present a generaliza- tion, specifying the amount of resources one has to pay in order to maintain representational power while layers are incrementally cut off from a deep network. For conciseness we defer this analysis to app. A , and merely state here our final conclusions. W e find that the representational penalty is double exponential w .r .t. the number of layers remov ed. In addition, there are certain cases where the removal of ev en a single layer leads to an exponential inflation, f alling in line with the suggestion of Bengio ( 2009 ). 7. If  denotes the Kroneck er product, then for any matrices A and B : r ank ( A  B ) = r ank ( A ) · r ank ( B ) . 8. Hence the definition of bor der rank , see Hackb usch ( 2012 ). 11 C O H E N S H A R I R S H A S H U A 5. Discussion In this work we address a fundamental issue in deep learning – the expressiv e ef ficiency of depth. There hav e been many attempts to theoretically analyze this question, but from a practical machine learning perspectiv e, existing results are limited. Most of the results apply to very specific types of networks that do not resemble ones used in practice, and none of the results account for the locality- sharing-pooling paradigm which forms the basis for con volutional networks – the most successful deep learning architecture to date. In addition, current analyses merely show existence of depth ef ficiency , i.e. of functions that are efficiently realizable by deep networks b ut not by shallo w ones. The practical implications of such findings are ar guably slight, as a-priori, it may be that only a small fraction of the functions realizable by deep networks enjoy depth efficienc y , and for all the rest shallo w networks suf fice. Our aim in this paper was to dev elop a theory that facilitates an analysis of depth efficienc y for networks that incorporate the widely used structural ingredients of locality , sharing and pooling. W e consider the task of classification into one of a finite set of categories Y = { 1 . . .Y } . Our instance space is defined to be the Cartesian product of N vector spaces, in compliance with the common practice of representing natural data through ordered local structures (e.g. images through patches). Each of the N vectors that compose an instance is represented by a descriptor of length M , generated by running the v ector through M “representation” functions. As customary , classification is achiev ed through maximization of score functions h y , one for ev ery category y ∈ Y . Each score function is a linear combination over the M N possible products that may be formed by taking one descriptor entry from every input vector . The coefficients for these linear combinations con veniently reside in tensors A y of order N and dimension M along each axis. W e construct networks that compute score functions h y by decomposing (factorizing) the coefficient tensors A y . The resulting networks are con volutional arithmetic circuits that incorporate locality , sharing and pooling, and operate on the N · M descriptor entries generated from the input. W e show that a shallow (single hidden layer) network realizes the classic CP (rank-1) tensor decomposition, whereas a deep network with log 2 N hidden layers realizes the recently introduced Hierarchical T ucker (HT) decomposition ( Hackbusch and K ¨ uhn ( 2009 )). Our fundamental result, presented in thm. 1 and corollary 2 , states that randomizing the weights of a deep netw ork by some continuous distribution will lead, with pr obability one , to score functions that cannot be approx- imated by a shallow network if the latter’ s size is not exponential (in N ). W e extend this result (thm. 3 and corollary 4 ) by deri ving analogous claims that compare two networks of an y depths, not just deep vs. shallo w . T o further highlight the connection between our networks and ones used in practice, we sho w (app. E ) that translating con v olution and product pooling computations to log-space (for numerical stability) gi ves rise to SimNets – a recently proposed deep learning architecture which has been sho wn to produce state of the art accuracy in computationally limited settings ( Cohen et al. ( 2016 )). Besides the central line of our work discussed abo ve, the construction and theory presented in this paper shed light on various conjectures and practices emplo yed by the deep learning community . First, with respect to the pooling operation, our analysis points to the possibility that perhaps it has more to do with factorization of computed functions than it does with translation inv ariance. This may serve as an explanation for the fact that pooling windo ws in state of the art con volutional networks are typically v ery small (see for example Simonyan and Zisserman ( 2014 )), often much smaller than the radius of translation one would like to be in variant to. Indeed, in our frame w ork, as 12 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S we sho w in app. A , pooling o ver lar ge windo ws and trimming do wn a netw ork’ s depth may bring to an exponential decrease in e xpressi ve ef ficiency . The second point our theory sheds light on is sharing . As discussed in sec. 3.3 , introducing weight sharing to a shallo w network (CP model) considerably limits its expressi ve po wer . The net- work can only represent symmetric tensors, which in turn means that it is location in variant w .r .t. input v ectors (patches). In the case of a deep network (HT model) the limitation posed by sharing is not as strict. Generated tensors need not be symmetric, implying that the network is capable of mod- eling location – a crucial ability in almost any real-world task. The above findings suggest that the sharing constraint is increasingly limiting as a network gets shallower , to the point where it causes complete ignorance to location. This could serve as an argument supporting the empirical success of deep con volutional networks – the y bind together the statistical and computational adv antages of sharing with many layers that mitigate its e xpressi v e limitations. Lastly , our construction advocates locality , or more specifically , 1 × 1 receptiv e fields. Recent con v olutional networks providing state of the art recognition performance (e.g. Lin et al. ( 2014 ); Szegedy et al. ( 2015 )) make extensi ve use of 1 × 1 linear transformations, proving them to be very successful in practice. In view of our model, such 1 × 1 operators factorize tensors while providing uni versality with a minimal number of parameters. It seems reasonable to conjecture that for this task of factorizing coef ficient tensors, larger recepti ve fields are not significantly helpful, as they lead to redundanc y which may deteriorate performance in presence of limited training data. In v estigation of this conjecture is left for future work. Acknowledgments Amnon Shashua would like to thank T omaso Poggio and Shai S. Shwartz for illuminating discus- sions during the preparation of this manuscript. W e would also like to thank T omer Galanti, T amir Hazan and Lior W olf for commenting on draft versions of the paper . The work is partly funded by Intel grant ICRI-CI no. 9-2012-6133 and by ISF Center grant 1790/12. Nadav Cohen is supported by a Google Fello wship in Machine Learning. References Animashree Anandkumar , Rong Ge, Daniel Hsu, Sham M Kakade, and Matus T elgarsky . T ensor decom- positions for learning latent variable models. Journal of Machine Learning Resear ch , 15(1):2773–2832, 2014. Richard Bellman, Richard Ernest Bellman, Richard Ernest Bellman, and Richard Ernest Bellman. Introduc- tion to matrix analysis , volume 960. SIAM, 1970. Y oshua Bengio. Learning Deep Architectures for AI. F oundations and T rends in Machine Learning , 2(1): 1–127, 2009. Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. Neur al Networks and Learning Systems, IEEE T r ansactions on , 25(8):1553–1565, 2014. Joan Bruna and St ´ ephane Mallat. In v ariant Scattering Con v olution Networks. IEEE TP AMI , 2012. Richard Caron and T im T raynor . The zero set of a polynomial. WSMR Report 05-02 , 2005. 13 C O H E N S H A R I R S H A S H U A Nadav Cohen and Amnon Shashua. Simnets: A generalization of con volutional networks. Advances in Neural Information Pr ocessing Systems (NIPS), Deep Learning W orkshop , 2014. Nadav Cohen and Amnon Shashua. Con volutional rectifier networks as generalized tensor decompositions. International Confer ence on Mac hine Learning (ICML) , 2016. Nadav Cohen, Or Sharir, and Amnon Shashua. Deep simnets. IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. G Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Contr ol, Signals and Systems , 2(4):303–314, 1989. Olivier Delalleau and Y oshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Pr ocessing Systems , pages 666–674, 2011. Ronen Eldan and Ohad Shamir . The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965 , 2015. F Girosi and T Poggio. Networks and the best approximation property. Biological cybernetics , 63(3):169– 176, 1990. W Hackbusch and S K ¨ uhn. A New Scheme for the T ensor Representation. Journal of F ourier Analysis and Applications , 15(5):706–722, 2009. W olfgang Hackbusch. T ensor Spaces and Numerical T ensor Calculus , volume 42 of Springer Series in Computational Mathematics . Springer Science & Business Media, Berlin, Heidelberg, February 2012. Benjamin D Haeffele and Ren ´ e V idal. Global Optimality in T ensor Factorization, Deep Learning, and Be- yond. CoRR abs/1202.2745 , cs.N A, 2015. Andr ´ as Hajnal, W olfgang Maass, P av el Pudl ´ ak, M ´ arl ´ o Sze gedy , and Gy ¨ orgy T ur ´ an. Threshold circuits of bounded depth. In F oundations of Computer Science, 1987., 28th Annual Symposium on , pages 99–110. IEEE, 1987. Johan Hastad. Almost optimal lower bounds for small depth circuits. In Pr oceedings of the eighteenth annual A CM symposium on Theory of computing , pages 6–20. A CM, 1986. Johan H ˚ astad and Mikael Goldmann. On the power of small-depth threshold circuits. Computational Com- plexity , 1(2):113–129, 1991. Kurt Hornik, Maxwell B Stinchcombe, and Halbert White. Multilayer feedforward networks are univ ersal approximators. Neural networks , 2(5):359–366, 1989. Brian Hutchinson, Li Deng, and Dong Y u. T ensor Deep Stacking Networks. IEEE T rans. P attern Anal. Mach. Intell. () , 35(8):1944–1957, 2013. Majid Janzamin, Hanie Sedghi, and Anima Anandkumar . Beating the Perils of Non-Con ve xity: Guaranteed T raining of Neural Networks using T ensor Methods. CoRR abs/1506.08473 , 2015. Frank Jones. Lebesgue inte gration on Euclidean space . Jones & Bartlett Learning, 2001. Mauricio Karchmer . Communication complexity a new approach to circuit depth. 1989. T amara G K olda and Brett W Bader . T ensor Decompositions and Applications. SIAM Review () , 51(3): 455–500, 2009. 14 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S V adim Lebedev , Y aroslav Ganin, Maksim Rakhuba, Ivan V Oseledets, and V ictor S Lempitsky . Speeding-up Con v olutional Neural Networks Using Fine-tuned CP-Decomposition. CoRR abs/1202.2745 , cs.CV , 2014. Y ann LeCun and Y oshua Bengio. Conv olutional networks for images, speech, and time series. The handbook of brain theory and neur al networks , 3361(10), 1995. Min Lin, Qiang Chen, and Shuicheng Y an. Network In Network. International Conference on Learning Repr esentations , 2014. Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir . On the computational efficienc y of training neural net- works. Advances in Neural Information Pr ocessing Systems , 2014. W olfgang Maass, Georg Schnitger , and Eduardo D Sontag. A comparison of the computational power of sigmoid and Boolean thr eshold cir cuits . Springer , 1994. James Martens and V enkatesh Medabalimi. On the expressiv e efficienc y of sum product networks. arXiv pr eprint arXiv:1411.7717 , 2014. James Martens, Arkadev Chattopadhya, T oni Pitassi, and Richard Zemel. On the representational efficiency of restricted boltzmann machines. In Advances in Neural Information Pr ocessing Systems , pages 2877– 2885, 2013. Guido F Montufar , Razvan Pascanu, Kyungh yun Cho, and Y oshua Bengio. On the number of linear regions of deep neural netw orks. In Advances in Neur al Information Pr ocessing Systems , pages 2924–2932, 2014. Alexander Noviko v , Anton Rodomanov , Anton Osokin, and Dmitry V etrov . Putting MRFs on a T ensor T rain. ICML , pages 811–819, 2014. Razvan Pascanu, Guido Montufar , and Y oshua Bengio. On the number of inference regions of deep feed forward networks with piece-wise linear acti vations. arXiv pr eprint arXiv , 1312, 2013. Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica , 8:143–195, January 1999. Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In Computer V ision W orkshops (ICCV W orkshops), 2011 IEEE International Confer ence on , pages 689–690. IEEE, 2011. Ran Raz and Amir Y ehudayoff. Lower bounds and separations for constant depth multilinear circuits. Com- putational Complexity , 18(2):171–207, 2009. Benjamin Rossman, Rocco A Serv edio, and Li-Y ang T an. An av erage-case depth hierarchy theorem for boolean circuits. arXiv pr eprint arXiv:1504.03398 , 2015. W alter Rudin. Functional analysis. international series in pure and applied mathematics, 1991. Thomas Serre, Lior W olf, and T omaso Poggio. Object Recognition with Features Inspired by V isual Corte x. CVPR , 2:994–1000, 2005. Hendra Setiawan, Zhongqiang Huang, Jacob Devlin, Thomas Lamar , Rabih Zbib, Richard M Schwartz, and John Makhoul. Statistical Machine Translation Features with Multitask T ensor Networks. Pr oceedings of the 53r d Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confer ence on Natural Languag e Pr ocessing of the Asian F ederation of Natural Languag e Processing , cs.CL, 2015. Amir Shpilka and Amir Y ehudayoff. Arithmetic circuits: A surve y of recent results and open questions. F oundations and T r ends in Theor etical Computer Science , 5(3–4):207–388, 2010. 15 C O H E N S H A R I R S H A S H U A Karen Simonyan and Andrew Zisserman. V ery deep conv olutional networks for large-scale image recogni- tion. arXiv pr eprint arXiv:1409.1556 , 2014. Michael Sipser . Borel sets and cir cuit complexity . A CM, Ne w Y ork, New Y ork, USA, December 1983. Richard Socher , Danqi Chen, Christopher D Manning, and Andrew Y Ng. Reasoning W ith Neural T ensor Networks for Kno wledge Base Completion. Advances in Neural Information Pr ocessing Systems , pages 926–934, 2013. Le Song, Mariya Ishte v a, Ankur P Parikh, Eric P Xing, and Haesun Park. Hierarchical T ensor Decomposition of Latent T ree Graphical Models. ICML , pages 334–342, 2013. Maxwell Stinchcombe and Halbert White. Universal approximation using feedforward networks with non- sigmoid hidden layer acti vation functions. International Joint Confer ence on Neural Networks , pages 613–617 vol.1, 1989. Christian Szegedy , W ei Liu, Y angqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelo v , Dumitru Erhan, V incent V anhoucke, and Andre w Rabino vich. Going Deeper with Con v olutions. CVPR , 2015. Y ani v T aigman, Ming Y ang, Marc’Aurelio Ranzato, and Lior W olf. DeepFace: Closing the Gap to Human- Lev el Performance in Face V erification. In CVPR ’14: Pr oceedings of the 2014 IEEE Conference on Computer V ision and P attern Recognition . IEEE Computer Society , June 2014. Matus T elgarsky . Representation benefits of deep feedforward networks. arXiv preprint , 2015. Y Y ang and D B Dunson. Bayesian conditional tensor factorizations for high-dimensional classification. Journal of the American Statistical , 2015. Dong Y u, Li Deng, and Frank Seide. Large V ocabulary Speech Recognition Using Deep T ensor Neural Networks. INTERSPEECH , pages 6–9, 2012. Daniel Zoran and Y air W eiss. ”Natural Images, Gaussian Mixtures and Dead Leav es”. Advances in Neural Information Pr ocessing Systems , pages 1745–1753, 2012. 16 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S A ppendix A. Generalized Theorem of Network Capacity In sec. 4 we presented our fundamental theorem of network capacity (thm. 1 and corollary 2 ), sho wing that besides a negligible set, all functions that can be realized by a polynomially sized HT model (with or without weight sharing), require exponential size in order to be realized, or e ven approximated, by CP model. In terms of network depth, CP and HT models represent the extremes – the former has only a single hidden layer achieved through global pooling, whereas the latter has L = log 2 N hidden layers achie v ed through minimal (size- 2 ) pooling windo ws. It is of interest to generalize the fundamental result by establishing a comparison between networks of intermediate depths. This is the focus of the current appendix. W e begin by defining a truncated version of the hierarchical tensor decomposition presented in eq. 4 : φ 1 ,j,γ = r 0 X α =1 a 1 ,j,γ α a 0 , 2 j − 1 ,α ⊗ a 0 , 2 j,α . . . φ l,j,γ = r l − 1 X α =1 a l,j,γ α φ l − 1 , 2 j − 1 ,α | {z } order 2 l − 1 ⊗ φ l − 1 , 2 j,α | {z } order 2 l − 1 . . . A = r L c − 1 X α =1 a L c α 2 L − L c +1 ⊗ j =1 φ L c − 1 ,j,α | {z } order 2 L c − 1 (5) The only dif ference between this decomposition and the original is that instead of completing the full process with L := log 2 N levels, we stop after L c ≤ L . At this point remaining tensors are binded together to form the final order- N tensor . The corresponding network will simply include a premature global pooling stage that shrinks feature maps to 1 × 1 , and then a final linear layer that performs classification. As before, we consider a shared version of the decomposition in which a l,j,γ ≡ a l,γ . Notice that this construction realizes a continuum between CP and HT models, which correspond to the extreme cases L c = 1 and L c = L respectiv ely . The following theorem, a generalization of thm. 1 , compares a truncated decomposition having L 1 lev els, to one with L 2 < L 1 lev els that implements the same tensor, quantifying the penalty in terms of parameters: Theorem 3 Let A (1) and A (2) be tensors of order N and dimension M in each mode, generated by the truncated r ecursive formulas in eq. 5 , with L 1 and L 2 levels respectively . Denote by { r (1) l } L 1 − 1 l =0 and { r (2) l } L 2 − 1 l =0 the composition ranks of A (1) and A (2) r espectively . Assuming w .l.o.g. that L 1 > L 2 , we define r := min { r (1) 0 , ..., r (1) L 2 − 1 , M } , and consider the space of all possible configurations for the par ameters of A (1) ’ s composition – { a (1) ,l,j,γ } l,j,γ . In this space, almost everywher e (w .r .t. Lebesgue measur e), the gener - ated tensor A (1) r equir es that r (2) L 2 − 1 ≥ ( r ) 2 L − L 2 if one wishes that A (2) be equal to A (1) . Put differ ently , the configurations for which A (1) can be r ealized by A (2) with r (2) L 2 − 1 < ( r ) 2 L − L 2 form a set of measur e zer o. The exact same r esult holds if we constrain the composition of A (1) to be “shar ed”, i.e. set a (1) ,l,j,γ ≡ a (1) ,l,γ and consider the space of { a (1) ,l,γ } l,γ configurations. In analogy with corollary 2 , we obtain the following generalization: Corollary 4 Suppose we are given linearly independent repr esentation functions f θ 1 . . .f θ M , and consider two networks that correspond to the truncated hierar chical tensor decomposition in eq. 5 , with L 1 and L 2 hidden layers respectively . Assume w .l.o.g. that L 1 > L 2 , i.e. that network 1 is deeper than network 2, and define r to be the minimal number of channels across the r epr esentation layer and the first L 2 hidden layers of network 1. Then, if we randomize the weights of network 1 by a continuous distribution, we obtain, with 17 C O H E N S H A R I R S H A S H U A pr obability one, score functions h y that cannot be approximated arbitrarily well (in L 2 sense) by network 2 if the latter has less than ( r ) 2 L − L 2 channels in its last hidden layer . The r esult holds e ven if we constrain network 1 with weight sharing while leaving network 2 in its general form. Proofs of thm. 3 and corollary 4 are giv en in app. B . Hereafter , we briefly discuss some of their impli- cations. First, notice that we indeed obtain a generalization of the fundamental theorem of network capacity (thm. 1 and corollary 2 ), which corresponds to the extreme case L 1 = L and L 2 = 1 . Second, note that for the baseline case of L 1 = L , i.e. a full-depth netw ork has generated the tar get score function, approximating this with a truncated network draws a price that grows double exponentially w .r .t. the number of missing layers. Third, and most intriguingly , we see that when L 1 is considerably smaller than L , i.e. when a significantly truncated network is sufficient to model our problem, cutting off ev en a single layer leads to an exponential price, and this price is independent of L 1 . Such scenarios of exponential penalty for trimming down a single layer were discussed in Bengio ( 2009 ), but only in the context of specific functions realized by networks that do not resemble ones used in practice (see H ˚ astad and Goldmann ( 1991 ) for an example of such result). W e prov e this in a much broader , more practical setting, sho wing that for con v olutional arithmetic circuit (Sim- Net – see app. E ) architectures, almost an y function realized by a significantly truncated netw ork will exhibit this behavior . The issue relates to empirical practice, supporting the common methodology of designing net- works that go as deep as possible. Specifically , it encourages extending network depth by pooling over small regions, a v oiding significant spatial decimation that brings network termination closer . W e conclude this appendix by stressing once more that our construction and theoretical approach are not limited to the models covered by our theorems (CP model, HT model, truncated HT model). These are merely exemplars deemed most appropriate for initial analysis. The fundamental and generalized theorems of netw ork capacity are similar in spirit, and analogous theorems for networks with dif ferent pooling windo w sizes and depths (corresponding to different tensor decompositions) may easily be deri ved. A ppendix B. Proofs B.1. Proof of Theor ems 1 and 3 Our proof of thm. 1 and 3 relies on basic knowledge in measure theory , or more specifically , Lebesgue measure spaces. W e do not provide here a comprehensive background on this field (the interested reader is referred to Jones ( 2001 )), b ut rather supplement the brief discussion given in sec. 2 , with a list of facts we will be using which are not necessarily intuitiv e: • A union of countably (or finitely) many sets of zero measure is itself a set of zero measure. • If p is a polynomial ov er d v ariables that is not identically zero, the set of points in R d in which it vanishes has zero measure (see Caron and T raynor ( 2005 ) for a short proof of this). • If S ⊂ R d 1 has zero measure, then S × R d 2 ⊂ R d 1 + d 2 , and ev ery set contained within, hav e zero measure as well. In the above, and in the entirety of this paper , the only measure spaces we consider are Euclidean spaces equipped with Lebesgue measure. Thus when we say that a set of d -dimensional points has zero measure, we mean that its Lebesgue measure in the d -dimensional Euclidean space is zero. Moving on to some preliminaries from matrix and tensor theory , we denote by [ A ] the matricization of an order- N tensor A (for simplicity , N is assumed to be even), where ro ws correspond to odd modes and columns correspond to e ven modes. Namely , if A ∈ R M 1 ×···× M N , the matrix [ A ] has M 1 · M 3 · . . . · M N − 1 rows and M 2 · M 4 · . . . · M N columns, rearranging the entries of the tensor such that A d 1 ...d N is stored in row index 1 + P N / 2 i =1 ( d 2 i − 1 − 1) Q N / 2 j = i +1 M 2 j − 1 and column index 1 + P N / 2 i =1 ( d 2 i − 1) Q N / 2 j = i +1 M 2 j . T o distinguish from the tensor product operation ⊗ , we denote the Kronecker product between matrices by  . Specifically , for two matrices A ∈ R M 1 × M 2 and B ∈ R N 1 × N 2 , A  B is the matrix in R M 1 N 1 × M 2 N 2 that holds A ij B kl in row index ( i − 1) N 1 + k and column index ( j − 1) N 2 + l . The basic relation that binds together tensor 18 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S product, matricization and Kronecker product is [ A ⊗ B ] = [ A ]  [ B ] , where A and B are tensors of even orders. T wo additional facts we will make use of are that the matricization is a linear operator (i.e. for scalars α 1 . . .α r and tensors with the same size A 1 . . . A r : [ P r i =1 α i A i ] = P r i =1 α i [ A i ] ), and less trivially , that for any matrices A and B , the rank of A  B is equal to r ank ( A ) · r ank ( B ) (see Bellman et al. ( 1970 ) for a proof). These two facts, along with the basic relation laid out abo ve, lead to the conclusion that: r ank h v ( z ) 1 ⊗ · · · ⊗ v ( z ) 2 L i = 2 L / 2 Y i =1 r ank v ( z ) 2 i − 1 v ( z ) > 2 i z }| { h v ( z ) 2 i − 1 ⊗ v ( z ) 2 i i = 1 and thus: r ank " Z X z =1 λ z v ( z ) 1 ⊗ · · · ⊗ v ( z ) 2 L # = r ank Z X z =1 λ z h v ( z ) 1 ⊗ · · · ⊗ v ( z ) 2 L i ≤ Z X z =1 r ank h v ( z ) 1 ⊗ · · · ⊗ v ( z ) 2 L i = Z In words, an order- 2 L tensor giv en by a CP-decomposition (see sec. 2 ) with Z terms, has matricization with rank at most Z . Thus, to pr ove that a certain or der - 2 L tensor has CP-rank of at least R , it suffices to show that its matricization has rank of at least R . W e now state and prov e two lemmas that will be needed for our proofs of thm. 1 and 3 . Lemma 5 Let M , N ∈ N , and define the following mapping taking x ∈ R 2 M N + N to thr ee matrices: A ( x ) ∈ R M × N , B ( x ) ∈ R M × N and D ( x ) ∈ R N × N . A ( x ) simply holds the first M N elements of x , B ( x ) holds the following M N elements of x , and D ( x ) is a diagonal matrix that holds the last N elements of x on its diagonal. Define the pr oduct matrix U ( x ) := A ( x ) D ( x ) B ( x ) > ∈ R M × M , and consider the set of points x for which the rank of U ( x ) is differ ent fr om r := min { M , N } . This set of points has zer o measur e. The r esult will also hold if the points x r eside in R M N + N , and the same elements ar e used to assign A ( x ) and B ( x ) ( A ( x ) ≡ B ( x ) ). Proof Obviously r ank ( U ( x )) ≤ r for all x , so it remains to show that rank ( U ( x )) ≥ r for all x b ut a set of zero measure. Let U r ( x ) be the top-left r × r sub-matrix of U ( x ) . If U r ( x ) is non-singular then of course rank ( U ( x )) ≥ r as required. It thus suf fices to sho w that the set of points x for which det U r ( x ) = 0 has zero measure. Now , det U r ( x ) is a polynomial in the entries of x , and so it either vanishes on a set of zero measure, or it is the zero polynomial (see Caron and Traynor ( 2005 )). All that is left is to disqualify the latter option, and that can be done by finding a specific point x 0 for which det U r ( x 0 ) 6 = 0 . Indeed, we may choose x 0 such that D ( x 0 ) is the identity matrix and A ( x 0 ) , B ( x 0 ) hold 1 on their main diagonal and 0 otherwise. This selection implies that U r ( x 0 ) is the identity matrix, and in particular det U r ( x 0 ) 6 = 0 . Lemma 6 Assume we have p continuous mappings fr om R d to R M × N taking the point y to the matri- ces A 1 ( y ) . . .A p ( y ) . Assume that under these mappings, the points y for which every i ∈ [ p ] satisfies r ank ( A i ( y )) < r form a set of zero measur e. Define a mapping fr om R p × R d to R M × N given by ( x , y ) 7→ A ( x , y ) := P p i =1 x i · A i ( y ) . Then, the points ( x , y ) for whic h rank ( A ( x , y )) < r form a set of zer o measure . Proof Denote S := { ( x , y ) : rank ( A ( x , y )) < r } ⊂ R p × R d . W e would like to show that this set has zero measure. W e first note that since A ( x , y ) is a continuous mapping, and the set of matrices A ∈ R M × N which hav e rank less than r is closed, S is a closed set and in particular measurable. Our strategy for computing its measure will be as follo ws. For e v ery y ∈ R d we define the marginal set S y := { x : r ank ( A ( x , y )) < r } ⊂ R p . W e will show that for every y b ut a set of zero measure, the measure of S y is zero. An application of Fubini’ s theorem will then prove the desired result. 19 C O H E N S H A R I R S H A S H U A Let C be the set of points y ∈ R d for which ∀ i ∈ [ p ] : r ank ( A i ( y )) < r . By assumption, C has zero measure. W e now sho w that for y 0 ∈ R d \ C , the measure of S y 0 is zero. By the definition of C there exists an i ∈ [ p ] such that rank ( A i ( y 0 )) ≥ r . W .l.o.g., we assume that i = 1 , and that the top-left r × r sub-matrix of A 1 ( y 0 ) is non-singular . Regarding y 0 as fixed, the determinant of the top-left r × r sub-matrix of A ( x , y 0 ) is a polynomial in the elements of x . It is not the zero polynomial, as setting x 1 = 1 , x 2 = · · · = x p = 0 yields A ( x , y 0 ) = A 1 ( y 0 ) , and the determinant of the latter’ s top-left r × r sub-matrix is non-zero. As a non-zero polynomial, the determinant of the top-left r × r sub-matrix of A ( x , y 0 ) v anishes only on a set of zero measure ( Caron and T raynor ( 2005 )). This implies that indeed the measure of S y 0 is zero. W e introduce a few notations tow ards our application of Fubini’ s theorem. First, the symbol 1 will be used to represent indicator functions, e.g. 1 S is the function from R p × R d to R that receives 1 on S and 0 elsewhere. Second, we use a subscript of n ∈ N to indicate that the corresponding set is intersected with the hyper-rectangle of radius n . For example, S n stands for the intersection between S and [ − n, n ] p + d , and R d n stands for the intersection between R d and [ − n, n ] d (which is equal to the latter). All the sets we consider are measurable, and those with subscript n hav e finite measure. W e may thus apply Fubini’ s theorem to get: Z ( x , y ) 1 S n = Z ( x , y ) ∈ R p + d n 1 S = Z y ∈ R d n Z x ∈ R p n 1 S y = Z y ∈ R d n ∩ C Z x ∈ R p n 1 S y + Z y ∈ R d n \ C Z x ∈ R p n 1 S y Recall that the set C ∈ R d has zero measure, and for every y / ∈ C the measure of S y ∈ R p is zero. This implies that both integrals in the last expression vanish, and thus R 1 S n = 0 . Finally , we use the monotone con v ergence theorem to compute R 1 S : Z 1 S = Z lim n →∞ 1 S n = lim n →∞ Z 1 S n = lim n →∞ 0 = 0 This shows that indeed our set of interest S has zero measure. W ith all preliminaries and lemmas in place, we turn to prove thm. 1 , establishing an exponential efficiency of HT decomposition (eq. 4 ) ov er CP decomposition (eq. 3 ). Proof [ of theorem 1 ] W e begin with the case of an “unshared” composition, i.e. the one given in eq. 4 (as opposed to the “shared” setting of a l,j,γ ≡ a l,γ ). Denoting for conv enience φ L, 1 , 1 := A y and r L = 1 , we will sho w by induction ov er l = 1 , ..., L that almost everywhere (at all points but a set of zero measure) w .r .t. { a l,j,γ } l,j,γ , all CP-ranks of the tensors { φ l,j,γ } j ∈ [ N / 2 l ] ,γ ∈ [ r l ] are at least r 2 l / 2 . In accordance with our discussion in the beginning of this subsection, it suf fices to consider the matricizations [ φ l,j,γ ] , and sho w that these all hav e ranks greater or equal to r 2 l / 2 almost ev erywhere. For the case l = 1 we ha v e: φ 1 ,j,γ = r 0 X α =1 a 1 ,j,γ α a 0 , 2 j − 1 ,α ⊗ a 0 , 2 j,α Denote by A ∈ R M × r 0 the matrix with columns { a 0 , 2 j − 1 ,α } r 0 α =1 , by B ∈ R M × r 0 the matrix with columns { a 0 , 2 j,α } r 0 α =1 , and by D ∈ R r 0 × r 0 the diagonal matrix with a 1 ,j,γ on its diagonal. Then, we may write [ φ 1 ,j,γ ] = AD B > , and according to lemma 5 the rank of [ φ 1 ,j,γ ] equals r := min { r 0 , M } almost everywhere w .r .t.  { a 0 , 2 j − 1 ,α } α , { a 0 , 2 j,α } α , a 1 ,j,γ  . T o see that this holds almost everywhere w .r .t. { a l,j,γ } l,j,γ , one should merely recall that for any dimensions d 1 , d 2 ∈ N , if the set S ⊂ R d 1 has zero measure, so does any subset of S × R d 2 ⊂ R d 1 + d 2 . A finite union of zero measure sets has zero measure, thus the fact that r ank [ φ 1 ,j,γ ] = r holds almost ev erywhere individually for any j ∈ [ N / 2 ] and γ ∈ [ r 1 ] , implies that it holds almost ev erywhere jointly for all j and γ . This pro ves our inducti ve hypothesis (unshared case) for l = 1 . Assume no w that almost e verywhere r ank [ φ l − 1 ,j 0 ,γ 0 ] ≥ r 2 l − 1 / 2 for all j 0 ∈ [ N / 2 l − 1 ] and γ 0 ∈ [ r l − 1 ] . For some specific choice of j ∈ [ N / 2 l ] and γ ∈ [ r l ] we hav e: φ l,j,γ = r l − 1 X α =1 a l,j,γ α φ l − 1 , 2 j − 1 ,α ⊗ φ l − 1 , 2 j,α = ⇒ [ φ l,j,γ ] = r l − 1 X α =1 a l,j,γ α [ φ l − 1 , 2 j − 1 ,α ]  [ φ l − 1 , 2 j,α ] 20 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S Denote M α := [ φ l − 1 , 2 j − 1 ,α ]  [ φ l − 1 , 2 j,α ] for α = 1 . . .r l − 1 . By our inducti ve assumption, and by the general property r ank ( A  B ) = r ank ( A ) · r ank ( B ) , we have that almost e verywhere the ranks of all matrices M α are at least r 2 l − 1 / 2 · r 2 l − 1 / 2 = r 2 l / 2 . Writing [ φ l,j,γ ] = P r l − 1 α =1 a l,j,γ α · M α , and noticing that { M α } do not depend on a l,j,γ , we turn our attention to lemma 6 . The lemma tells us that r ank [ φ l,j,γ ] ≥ r 2 l / 2 almost ev erywhere. Since a finite union of zero measure sets has zero measure, we conclude that almost e verywhere r ank [ φ l,j,γ ] ≥ r 2 l / 2 holds jointly for all j ∈ [ N / 2 l ] and γ ∈ [ r l ] . This completes the proof of the theorem in the unshared case. Proving the theorem in the shared case may be done in the e xact same way , e xcept that for l = 1 one needs the version of lemma 5 for which A ( x ) and B ( x ) are equal. W e now head on to pro ve thm. 3 , which is a generalization of thm. 1 . The proof will be similar in nature to that of thm. 1 , yet slightly more technical. In short, the idea is to show that in the generic case, expressing A (1) as a sum of tensor products between tensors of order 2 L 2 − 1 requires at least r N / 2 L 2 terms. Since A (2) is expressed as a sum of r L 2 − 1 such terms, demanding A (2) = A (1) implies r L 2 − 1 ≥ r N / 2 L 2 . T o gain technical advantage and utilize known results from matrix theory (as we did when pro ving thm. 1 ), we introduce a ne w tensor “squeezing” operator ϕ . For q ∈ N , ϕ q is an operator that receives a tensor with order di visible by q , and returns the tensor obtained by merging together the latter’ s modes in groups of size q . Specifically , when applied to the tensor A ∈ R M 1 ×···× M c · q ( c ∈ N ), ϕ q returns a ten- sor of order c which holds A d 1 ...d c · q in the location defined by the following index for ev ery mode t ∈ [ c ] : 1 + P q i =1 ( d i + q ( t − 1) − 1) Q q j = i +1 M j + q ( t − 1) . Notice that when applied to a tensor of order q , ϕ q returns a vector . Also note that if A and B are tensors with orders divisible by q , and λ is a scalar, we hav e the desirable properties: • ϕ q ( A ⊗ B ) = ϕ q ( A ) ⊗ ϕ q ( B ) • ϕ q ( λ A + B ) = λϕ q ( A ) + ϕ q ( B ) For the sake of our proof we are interested in the case q = 2 L 2 − 1 , and denote for brevity ϕ := ϕ 2 L 2 − 1 . As stated above, we would like to show that in the generic case, expressing A (1) as P Z z =1 φ ( z ) 1 ⊗ · · · ⊗ φ ( z ) N / 2 L 2 − 1 , where φ ( z ) i are tensors of order 2 L 2 − 1 , implies Z ≥ r N / 2 L 2 . Applying ϕ to both sides of such a decomposition gi ves: ϕ ( A (1) ) = P Z z =1 ϕ ( φ ( z ) 1 ) ⊗ · · · ⊗ ϕ ( φ ( z ) N / 2 L 2 − 1 ) , where ϕ ( φ ( z ) i ) are no w vectors. Thus, to prov e thm. 3 it suffices to show that in the generic case, the CP-rank of ϕ ( A (1) ) is at least r N / 2 L 2 , or alternativ ely , that the rank of the matricization [ ϕ ( A (1) )] is at least r N / 2 L 2 . This will be our strategy in the following proof: Proof [ of theor em 3 ] In accordance with the above discussion, it suffices to sho w that in the generic case r ank [ ϕ ( A (1) )] ≥ r N / 2 L 2 . T o ease the path for the reader , we reformulate the problem using slightly simpler notations. W e have an order - N tensor A with dimension M in each mode, generated as follows: φ 1 ,j,γ = r 0 X α =1 a 1 ,j,γ α a 0 , 2 j − 1 ,α ⊗ a 0 , 2 j,α . . . φ l,j,γ = r l − 1 X α =1 a l,j,γ α φ l − 1 , 2 j − 1 ,α | {z } order 2 l − 1 ⊗ φ l − 1 , 2 j,α | {z } order 2 l − 1 . . . A = r L 1 − 1 X α =1 a L 1 , 1 , 1 α 2 L − L 1 +1 ⊗ j =1 φ L 1 − 1 ,j,α | {z } order 2 L 1 − 1 21 C O H E N S H A R I R S H A S H U A where: • L 1 ≤ L := log 2 N • r 0 , ..., r L 1 − 1 ∈ N > 0 • a 0 ,j,α ∈ R M for j ∈ [ N ] and α ∈ [ r 0 ] • a l,j,γ ∈ R r l − 1 for l ∈ [ L 1 − 1] , j ∈ [ N / 2 l ] and γ ∈ [ r l ] • a L 1 , 1 , 1 ∈ R r L 1 − 1 Let L 2 be a positive integer smaller than L 1 , and let ϕ be the tensor squeezing operator that merges groups of 2 L 2 − 1 modes. Define r := min { r 0 , ..., r L 2 − 1 , M } . W ith [ · ] being the matricization operator defined in the be ginning of the appendix, our task is to prov e that r ank [ ϕ ( A )] ≥ r N / 2 L 2 almost ev erywhere w .r .t. { a l,j,γ } l,j,γ . W e also consider the case of shared parameters – a l,j,γ ≡ a l,γ , where we would like to show that the same condition holds almost ev erywhere w .r .t. { a l,γ } l,γ . Our strategy for proving the claim is inductiv e. W e show that for l = L 2 . . .L 1 − 1 , almost everywhere it holds that for all j and all γ : rank [ ϕ ( φ l,j,γ )] ≥ r 2 l − L 2 . W e then treat the special case of l = L 1 , showing that indeed r ank [ ϕ ( A )] ≥ r N / 2 L 2 . W e begin with the setting of unshared parameters ( a l,j,γ ), and afterwards attend the scenario of shared parameters ( a l,γ ) as well. Our first task is to treat the case l = L 2 , i.e. sho w that r ank [ ϕ ( φ L 2 ,j,γ )] ≥ r almost ev erywhere jointly for all j and all γ (there is actually no need for the matricization [ · ] here, as ϕ ( φ L 2 ,j,γ ) are already matrices). Since a union of finitely many zero measure sets has zero measure, it suf fices to show that this condition holds almost everywhere when specific j and γ are chosen. Denote by e i a vector holding 1 in entry i and 0 elsewhere, by 0 a vector of zeros, and by 1 a vector of ones. Suppose that for ev ery j we assign a 0 ,j,α to be e α when α ≤ r and 0 otherwise. Suppose also that for all 1 ≤ l ≤ L 2 − 1 and all j we set a l,j,γ to be e γ when γ ≤ r and 0 otherwise. Finally , assume we set a L 2 ,j,γ = 1 for all j and all γ . These settings imply that for e very j , when γ ≤ r we hav e φ L 2 − 1 ,j,γ = ⊗ 2 L 2 − 2 j =1 ( e γ ⊗ e γ ) , i.e. the tensor φ L 2 − 1 ,j,γ holds 1 in location ( γ , ..., γ ) and 0 elsewhere. If γ > r then φ L 2 − 1 ,j,γ is the zero tensor . W e conclude from this that there are indices 1 ≤ i 1 < ... < i r ≤ M L 2 − 1 such that ϕ ( φ L 2 − 1 ,j,γ ) = e i γ for γ ≤ r , and that for γ > r we have ϕ ( φ L 2 − 1 ,j,γ ) = 0 . W e may thus write: ϕ ( φ L 2 ,j,γ ) = ϕ r L 2 − 1 X α =1 φ L 2 − 1 , 2 j − 1 ,α ⊗ φ L 2 − 1 , 2 j,α ! = r L 2 − 1 X α =1 ϕ ( φ L 2 − 1 , 2 j − 1 ,α ) ⊗ ϕ ( φ L 2 − 1 , 2 j,α ) = r X α =1 e i α e > i α Now , since i 1 . . .i r are different from each other, the matrix ϕ ( φ L 2 ,j,γ ) has rank r . This howe v er does not prov e our inductive hypothesis for l = L 2 . W e merely sho wed a specific parameter assignment for which it holds, and we need to show that it is met almost everywhere. T o do so, we consider an r × r sub-matrix of ϕ ( φ L 2 ,j,γ ) which is non-singular under the specific parameter assignment we defined. The determinant of this sub-matrix is a polynomial in the elements of { a l,j,γ } l,j,γ which we kno w does not vanish with the specific assignments defined. Thus, this polynomial vanishes at subset of { a l,j,γ } l,j,γ having zero measure (see Caron and Traynor ( 2005 )). That is to say , the sub-matrix of ϕ ( φ L 2 ,j,γ ) has rank r almost everywhere, and thus ϕ ( φ L 2 ,j,γ ) has rank at least r almost ev erywhere. This completes our treatment of the case l = L 2 . W e now turn to prove the propagation of our inductive hypothesis. Let l ∈ { L 2 + 1 , ..., L 1 − 1 } , and assume that our inducti ve hypothesis holds for l − 1 . Specifically , assume that almost everywhere w .r .t. { a l,j,γ } l,j,γ , we have that rank [ ϕ ( φ l − 1 ,j,γ )] ≥ r 2 l − 1 − L 2 jointly for all j ∈ [ N / 2 l − 1 ] and all γ ∈ [ r l − 1 ] . W e would like to show that almost everywhere, r ank [ ϕ ( φ l,j,γ )] ≥ r 2 l − L 2 jointly for all j ∈ [ N / 2 l ] and all γ ∈ [ r l ] . Again, the fact that a finite union of zero measure sets has zero measure implies that we may prov e the condition for specific j ∈ [ N / 2 l ] and γ ∈ [ r l ] . Applying the squeezing operator ϕ followed by 22 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S matricization [ · ] to the recursiv e e xpression for φ l,j,γ , we get: [ ϕ ( φ l,j,γ )] = " ϕ r l − 1 X α =1 a l,j,γ α φ l − 1 , 2 j − 1 ,α ⊗ φ l − 1 , 2 j,α !# = " r l − 1 X α =1 a l,j,γ α ϕ ( φ l − 1 , 2 j − 1 ,α ) ⊗ ϕ ( φ l − 1 , 2 j,α ) # = r l − 1 X α =1 a l,j,γ α [ ϕ ( φ l − 1 , 2 j − 1 ,α )]  [ ϕ ( φ l − 1 , 2 j,α )] For α = 1 . . .r l − 1 , denote the matrix [ ϕ ( φ l − 1 , 2 j − 1 ,α )]  [ ϕ ( φ l − 1 , 2 j,α )] by M α . The fact that the Kronecker product multiplies ranks, along with our inductiv e assumption, imply that almost ev erywhere r ank ( M α ) ≥ r 2 l − 1 − L 2 · r 2 l − 1 − L 2 = r 2 l − L 2 . Noting that the matrices M α do not depend on a l,j,γ , we apply lemma 6 and conclude that almost everywhere rank [ ϕ ( φ l,j,γ )] ≥ r 2 l − L 2 , which completes the prove of the inductive propagation. Next, we treat the special case l = L 1 . W e assume now that almost e verywhere r ank [ ϕ ( φ L 1 − 1 ,j,γ )] ≥ r 2 L 1 − 1 − L 2 jointly for all j and all γ . Again, we apply the squeezing operator ϕ follo wed by matricization [ · ] , this time to both sides of the expression for A : [ ϕ ( A )] = r L 1 − 1 X α =1 a L 1 , 1 , 1 α 2 L − L 1 +1  j =1 [ ϕ ( φ L 1 − 1 ,j,α )] As before, denote M α :=  2 L − L 1 +1 j =1 [ ϕ ( φ L 1 − 1 ,j,α )] for α = 1 . . .r L 1 − 1 . Using again the multiplicativ e rank property of the Kronecker product along with our inductiv e assumption, we get that almost everywhere r ank ( M α ) ≥ Q 2 L − L 1 +1 j =1 r 2 L 1 − 1 − L 2 = r L − L 2 . Noticing that { M α } α ∈ [ r L 1 − 1 ] do not depend on a L 1 , 1 , 1 , we apply lemma 6 for the last time and get that almost ev erywhere (w .r .t. { a l,j,γ } l,j,γ ), the rank of [ ϕ ( A )] is at least r L − L 2 . This completes our proof in the case of unshared parameters. Proving the theorem in the case of shared parameters ( a l,j,γ ≡ a l,γ ) can be done in the exact same way as above. In fact, all one has to do is omit the references to j and the proof will apply . Notice in particular that the specific parameter assignment we defined to handle l = L 2 was completely symmetric, i.e. it did not include any dependence on j . B.2. Proof of Cor ollaries 2 and 4 Corollaries 2 and 4 are a direct continuation of thm. 1 and 3 respectively . In the theorems, we hav e shown that almost all coefficient tensors generated by a deep network cannot be realized by a shallow network if the latter does not meet a certain minimal size requirement. The corollaries take this further, by stating that giv en linearly independent representation functions f θ 1 . . .f θ M , not only is efficient realization of coefficient tensors generally impossible, b ut also ef ficient approximation of score functions. T o pro ve this extra step, we recall from the proofs of thm. 1 and 3 (app. B.1 ) that in order to show separation between the coefficient tensor of a deep network and that of a shallow network, we relied on matricization rank. Specifically , we deriv ed constants R D , R S ∈ N , R D > R S , such that the matricization of a deep netw ork’ s coefficient tensor had rank greater or equal to R D , whereas the matricization of a shallo w network’ s coef ficient tensor had rank smaller or equal to R S . Giv en this observation, corollaries 2 and 4 readily follow from lemma 7 below (the lemma relies on basic concepts and results from the topic of L 2 Hilbert spaces – see app. C.1 for a brief discussion on the matter). Lemma 7 Let f θ 1 . . .f θ M ∈ L 2 ( R s ) be a set of linearly independent functions, and denote by T the (Eu- clidean) space of tensors with or der N and dimension M in each mode. F or a given tensor A ∈ T , denote 23 C O H E N S H A R I R S H A S H U A by h ( A ) the function in L 2  ( R s ) N  defined by: ( x 1 , . . . , x N ) h ( A ) 7→ M X d 1 ,...,d N =1 A d 1 ...d N N Y i =1 f θ d i ( x i ) Let {A λ } λ ∈ Λ ⊂ T be a family of tensors, and A ∗ be a certain targ et tensor that lies outside the family . Assume that for all λ ∈ Λ we have r ank ([ A λ ]) < r ank ([ A ∗ ]) , where [ · ] is the matricization operator defined in app. B.1 . Then, the distance in L 2  ( R s ) N  between h ( A ∗ ) and { h ( A λ ) } λ ∈ Λ is strictly positive, i.e. there e xists an  > 0 suc h that: ∀ λ ∈ Λ : Z   h ( A λ ) − h ( A ∗ )   2 >  Proof The fact that { f θ d ( x ) } d ∈ [ M ] are linearly independent in L 2 ( R s ) implies that the product functions { Q N i =1 f θ d i ( x i ) } d 1 ...d N ∈ [ M ] are linearly independent in L 2  ( R s ) N  (see app. C.1 ). Let ( h ( t ) ) ∞ t =1 be a se- quence of functions that lie in the span of { Q N i =1 f θ d i ( x i ) } d 1 ...d N ∈ [ M ] , and for ev ery t ∈ N denote by A ( t ) the coefficient tensor of h ( t ) under this basis, i.e. A ( t ) ∈ T is defined by: h ( t ) ( x 1 , . . . , x N ) = M X d 1 ,...,d N =1 A ( t ) d 1 ,...,d N N Y i =1 f θ d i ( x i ) Assume that ( h ( t ) ) ∞ t =1 con v erges to h ( A ∗ ) in L 2  ( R s ) N  : lim t →∞ Z    h ( t ) − h ( A ∗ )    2 = 0 In a finite-dimensional Hilbert space, con ver gence in norm implies con ver gence in representation coefficients under any preselected basis. W e thus hav e: ∀ d 1 . . .d N ∈ [ M ] : A ( t ) d 1 ,...,d N t →∞ − − − → A ∗ d 1 ,...,d N This means in particular that in the tensor space T , A ∗ lies in the closure of {A ( t ) } ∞ t =1 . Accordingly , in order to show that the distance in L 2  ( R s ) N  between h ( A ∗ ) and { h ( A λ ) } λ ∈ Λ is strictly positiv e, it suffices to show that the distance in T between A ∗ and {A λ } λ ∈ Λ is strictly positiv e, or equiv alently , that the distance between the matrix [ A ∗ ] and the family of matrices { [ A λ ] } λ ∈ Λ is strictly positive. This ho wev er is a direct implication of the assumption ∀ λ ∈ Λ : r ank ([ A λ ]) < r ank ([ A ∗ ]) . A ppendix C. Derivation of Hypotheses Space In order to keep the body of the paper at a reasonable length, the presentation of our hypotheses space (eq. 2 ) in sec. 3 did not provide the grounds for its definition. In this appendix we derive the h ypotheses space step by step. After establishing basic preliminaries on the topic of L 2 spaces, we utilize the notion of tensor products between such spaces to reach a universal representation as in eq. 2 but with M → ∞ . W e then mak e use of empirical studies characterizing the statistics of natural images, to argue that in practice a moderate value of M ( M ∈ Ω(100) ) suffices. C.1. Preliminaries on L 2 Spaces When dealing with functions over scalars, vectors or collections of vectors, we consider L 2 spaces, or more formally , the Hilbert spaces of Lebesgue measurable square-inte grable real functions equipped with standard 24 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S (point-wise) addition and scalar multiplication, as well as the inner-product defined by integral over point- wise multiplication. The topic of L 2 function spaces lies at the heart of functional analysis, and requires basic knowledge in measure theory . W e present here the bare necessities required to follo w this appendix, referring the interested reader to Rudin ( 1991 ) for a more comprehensiv e introduction. For our purposes, it suffices to vie w an L 2 space as a vector space of all functions f satisfying R f 2 < ∞ . This v ector space is infinite dimensional, and a set of functions F ⊂ L 2 is referred to as total if the closure of its span covers the entire space, i.e. if for any function g ∈ L 2 and  > 0 , there exist functions f 1 . . .f K ∈ F and coefficients c 1 . . .c K ∈ R such that R | P K i =1 c i · f i − g | 2 <  . F is re garded as linearly independent if all of its finite subsets are linearly independent, i.e. for any f 1 . . .f K ∈ F , f i 6 = f j , and c 1 . . .c K ∈ R , if P K i =1 c i · f i = 0 then c 1 = · · · = c K = 0 . A non-trivial result states that L 2 spaces in general must contain total and linearly independent sets, and moreov er , for any s ∈ N , L 2 ( R s ) contains a countable set of this type. It seems reasonable to draw an analogy between total and linearly independent sets in L 2 space, and bases in a finite dimensional vector space. While this analogy is indeed appropriate from our perspecti ve, total and linearly independent sets are not to be confused with bases for L 2 spaces, which are typically defined to be orthonormal. It can be shown (see for e xample Hackbusch ( 2012 )) that for any natural numbers s and N , if { f d ( x ) } d ∈ N is a total or a linearly independent set in L 2 ( R s ) , then { ( x 1 , . . . , x N ) 7→ Q N i =1 f d i ( x i ) } d 1 ...d N ∈ N , the in- duced point-wise product functions on ( R s ) N , form a set which is total or linearly independent, respecti vely , in L 2  ( R s ) N  . As we now briefly outline, this result actually emerges from a deep relation between tensor products and Hilbert spaces. The definitions gi v en in sec. 2 for a tensor , tensor space, and tensor product, are actually concrete special cases of much deeper, abstract algebraic concepts. A more formal line of presenta- tion considers multiple vector spaces V 1 . . .V N , and defines their tensor product space V 1 ⊗ · · · ⊗ V N to be a specific quotient space of the space freely generated by their Cartesian product set. For e very combination of vectors v ( i ) ∈ V i , i ∈ [ N ] , there exists a corresponding element v (1) ⊗ · · · ⊗ v ( N ) in the tensor product space, and moreover , elements of this form span the entire space. If V 1 . . .V N are Hilbert spaces, it is possible to equip V 1 ⊗ · · · ⊗ V N with a natural inner-product operation, thereby turning it too into a Hilbert space. It may then be shown that if the sets { v ( i ) α } α ⊂ V i , i ∈ [ N ] , are total or linearly independent, elements of the form v (1) α 1 ⊗ · · · ⊗ v ( N ) α N are total or linearly independent, respectively , in V 1 ⊗ · · · ⊗ V N . Finally , when the under- lying Hilbert spaces are L 2 ( R s ) , the point-wise product mapping f 1 ( x ) ⊗ · · · ⊗ f N ( x ) 7→ Q N i =1 f i ( x i ) from the tensor product space  L 2 ( R s )  ⊗ N := L 2 ( R s ) ⊗ · · · ⊗ L 2 ( R s ) to L 2  ( R s ) N  , induces an isomorphism of Hilbert spaces. C.2. Construction Recall from sec. 3 that our instance space is defined as X := ( R s ) N , in accordance with the common prac- tice of representing natural data through ordered local structures (for example images are often represented through small patches around their pixels). W e classify instances into categories Y := { 1 . . .Y } via maxi- mization of per-label score functions { h y : ( R s ) N → R } y ∈Y . Our hypotheses space H is defined to be the subset of L 2  ( R s ) N  from which score functions may be taken. In app. C.1 we stated that if { f d ( x ) } d ∈ N is a total set in L 2 ( R s ) , i.e. if every function in L 2 ( R s ) can be arbitrarily well approximated by a linear combination of a finite subset of { f d ( x ) } d ∈ N , then the point- wise products { ( x 1 , . . . , x N ) 7→ Q N i =1 f d i ( x j ) } d 1 ,...,d N ∈ N form a total set in L 2  ( R s ) N  . Accordingly , in a univ ersal hypotheses space H = L 2  ( R s ) N  , an y score function h y may be arbitrarily well approximated by finite linear combinations of such point-wise products. A possible formulation of this would be as follo ws. Assume we are interested in  -approximation of the score function h y , and consider a formal tensor A y having N modes and a countable infinite dimension in each mode i ∈ [ N ] , indexed by d i ∈ N . Then, there exists such a tensor , with all but a finite number of entries set to zero, for which: h y ( x 1 , . . . , x N ) ≈ X d 1 ...d N ∈ N A y d 1 ,...,d N N Y i =1 f d i ( x i ) (6) 25 C O H E N S H A R I R S H A S H U A Giv en that the set of functions { f d ( x ) } d ∈ N ⊂ L 2 ( R s ) is total, eq. 6 defines a uni versal hypotheses space. There are many possibilities for choosing a total set of functions. W av elets are perhaps the most obvious choice, and were indeed used in a deep network setting by Bruna and Mallat ( 2012 ). The special case of Gabor wav elets has been claimed to induce features that resemble representations in the visual cortex ( Serre et al. ( 2005 )). T wo options we pay special attention to due to their importance in practice are: • Gaussians (with diagonal cov ariance): f θ ( x ) = N  x ; µ , diag ( σ 2 )  (7) where θ = ( µ ∈ R s , σ 2 ∈ R s ++ ) . • Neur ons : f θ ( x ) = σ  x > w + b  (8) where θ = ( w ∈ R s , b ∈ R ) and σ is a point-wise non-linear activ ation such as threshold σ ( z ) = 1 [ z > 0] , rectified linear unit (ReLU) σ ( z ) = max { z , 0 } or sigmoid σ ( z ) = 1 / (1 + e − z ) . In both cases, there is an underlying parametric family of functions F = { f θ : R s → R } θ ∈ Θ of which a countable total subset may be chosen. The fact that Gaussians as abov e are total in L 2 ( R s ) has been prov en in Girosi and Poggio ( 1990 ), and is a direct corollary of the Stone-W eierstrass theorem. T o achie ve countability , simply consider Gaussians with rational parameters (mean and variances). In practice, the choice of Gaussians (with diagonal cov ariance) giv e rises to a “similarity” operator as described by the SimNet architecture ( Cohen and Shashua ( 2014 ); Cohen et al. ( 2016 )). For the case of neurons we must restrict the domain R s to some bounded set, otherwise the functions are not integrable. This howe ver is not a limitation in practice, and indeed neurons are widely used across many application domains. The fact that neurons are total has been prov en in Cybenko ( 1989 ) and Hornik et al. ( 1989 ) for threshold and sigmoid activations. More generally , it has been proven in Stinchcombe and White ( 1989 ) for a wide class of activ ation functions, including linear combinations of ReLU. See Pinkus ( 1999 ) for a survey of such results. For countability , we may again restrict parameters (weights and bias) to be rational. In the case of Gaussians and neurons, we argue that a finite set of functions suffices, i.e. that it is possible to choose f θ 1 . . .f θ M ∈ F that will suffice in order to represent score functions required for natural tasks. Moreov er , we claim that M need not be lar ge (e.g. on the order of 100). Our argument relies on statistical properties of natural images, and is fully detailed in app. C.3 . It implies that under proper choice of { f θ d ( x ) } d ∈ [ M ] , the finite set of point-wise product functions { ( x 1 , . . . , x N ) 7→ Q N i =1 f θ d i ( x i ) } d 1 ,...,d N ∈ [ M ] spans the score functions of interest, and we may define for each label y a tensor A y of order N and dimension M in each mode, such that: h y ( x 1 , . . . , x N ) = M X d 1 ,...,d N =1 A y d 1 ,...,d N N Y i =1 f θ d i ( x i ) (2) which is e xactly the hypotheses space presented in sec. 3 . Notice that if { f θ d ( x ) } d ∈ [ M ] ⊂ L 2 ( R s ) are linearly independent (there is no reason to choose them otherwise), then so are the product functions { ( x 1 , . . . , x N ) 7→ Q N i =1 f θ d i ( x i ) } d 1 ,...,d N ∈ [ M ] ⊂ L 2  ( R s ) N  (see app. C.1 ), and a score function h y uniquely determines the coefficient tensor A y . In other words, two score functions h y , 1 and h y , 2 are identical if and only if their coefficient tensors A y , 1 and A y , 2 are the same. C.3. Finite Function Bases for Classification of Natural Data In app. C.2 we laid out the framework of classifying instances in the space X := { ( x 1 , . . . , x N ) : x i ∈ R s } = ( R s ) N into labels Y := { 1 , . . . , Y } via maximization of per-label score functions h y : X → R : ˆ y ( x 1 , . . . , x N ) = argmax y ∈Y h y ( x 1 , . . . , x N ) 26 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S where h y ( x 1 , . . . , x N ) is of the form: h y ( x 1 , . . . , x N ) = M X d 1 ,...,d N =1 A y d 1 ,...,d N N Y i =1 f θ d i ( x i ) (2) and { f θ } d ∈ [ M ] are selected from a parametric family of functions F = { f θ : R s → R } θ ∈ Θ . For uni versality , i.e. for the ability of score functions h y to approximate an y function in L 2 ( X ) as M → ∞ , we required that it be possible to choose a countable subset of F that is total in L 2 ( R s ) . W e noted that the families of Gaussians (eq. 7 ) and neurons (eq. 8 ) meet this requirement. In this subsection we formalize our argument that a finite value for M is sufficient when X represents natural data, and in particular , natural images. Based on empirical studies characterizing the statistical prop- erties of natural images, and in compliance with the number of channels in a typical con v olutional network layer , we find that M on the order of 100 typically suf fices. Let D be a distribution of labeled instances ( X , ¯ y ) over X × Y (we use bar notation to distinguish the label ¯ y from the running index y ), and D X be the induced marginal distribution of instances X over X . W e would like to show , given particular assumptions on D , that there exist functions f θ 1 , . . . , f θ M ∈ F and tensors A 1 , . . . , A Y of order N and dimension M in each mode, such that the score functions h y defined in eq. 2 achiev e lo w classification error: L 0 − 1 D ( h 1 , . . . , h Y ) := E ( X, ¯ y ) ∼D  1  ¯ y 6 = argmax y ∈Y h y ( X )  (9) 1 [ · ] here stands for the indicator function, taking the value 1 when its argument is true, and 0 otherwise. Let { h ∗ y } y ∈Y be a set of “ground truth” score functions for which optimal prediction is achieved, or more specifically , for which the expected hinge-loss (upper bounds the 0-1 loss) is minimal: ( h ∗ 1 , . . . , h ∗ Y ) = argmin h 0 1 ,...,h 0 Y : X → R L hing e D ( h 0 1 , . . . , h 0 Y ) where: L hing e D ( h 0 1 , . . . , h 0 Y ) := E ( X, ¯ y ) ∼D  max y ∈Y  1 [ y 6 = ¯ y ] + h 0 y ( X )  − h 0 ¯ y ( X )  (10) Our strate gy will be to select score functions h y of the format gi ven in eq. 2 , that approximate h ∗ y in the sense of low e xpected maximal absolute dif ference: E := E X ∼D X  max y ∈Y   h y ( X ) − h ∗ y ( X )    (11) W e refer to E as the scor e appr oximation err or obtained by h y . The 0-1 loss of h y with respect to the labeled example ( X, ¯ y ) ∈ X × Y is bounded as follows: 1  ¯ y 6 = argmax y ∈Y h y ( X )  ≤ max y ∈Y { 1 [ y 6 = ¯ y ] + h y ( X ) } − h ¯ y ( X ) = max y ∈Y  1 [ y 6 = ¯ y ] + h ∗ y ( X ) + h y ( X ) − h ∗ y ( X )  − h ∗ ¯ y ( X ) + h ∗ ¯ y ( X ) − h ¯ y ( X ) ≤ max y ∈Y  1 [ y 6 = ¯ y ] + h ∗ y ( X )  − h ∗ ¯ y ( X ) + max y ∈Y  h y ( X ) − h ∗ y ( X )  + h ∗ ¯ y ( X ) − h ¯ y ( X ) ≤ max y ∈Y  1 [ y 6 = ¯ y ] + h ∗ y ( X )  − h ∗ ¯ y ( X ) + 2 max y ∈Y    h y ( X ) − h ∗ y ( X )    T aking expectation of the first and last terms above with respect to ( X, ¯ y ) ∼ D , and recalling the definitions giv en in eq. 9 , 10 and 11 , we get: L 0 − 1 D ( h 1 , . . . , h Y ) ≤ L hing e D ( h ∗ 1 , . . . , h ∗ Y ) + 2 E 27 C O H E N S H A R I R S H A S H U A In words, the classification error of the score functions h y is bounded by the optimal expected hinge-loss plus a term equal to twice their score approximation error . Recall that we did not constrain the optimal score functions h ∗ y in any way . Thus, assuming a label is deterministic gi v en an instance, the optimal expected hinge-loss is essentially zero, and the classification error of h y is dominated by their score approximation error E (eq. 11 ). Our problem thus translates to showing that h y can be selected such that E is small. At this point we introduce our main assumption on the distribution D , or more specifically , on the marginal distribution of instances D X . According to v arious studies, in natural settings, the marginal dis- tribution of indi vidual vectors in X , e.g. of small patches in images, may be relativ ely well captured by a Gaussian Mixture Model ( GMM ) with a moderate number (on the order of 100 or less) of distinct compo- nents. For example, it was shown in Zoran and W eiss ( 2012 ) that natural image patches of size 2 × 2 , 4 × 4 , 8 × 8 or 16 × 16 , can essentially be modeled by GMMs with 64 components (adding more components barely improv ed the log-likelihood). This complies with the common belief that a moderate number of low-le vel templates suffices in order to model the vast majority of local image patches. Follo wing this line, we model the marginal distrib ution of x i with a GMM ha ving M components with means µ 1 . . . µ M ∈ R s . W e assume that the components are well localized, i.e. that their standard deviations are small compared to the distances between means, and also compared to the variation of the target functions h ∗ y . In the context of images for example, the latter two assumptions imply that a local patch can be unambiguously assigned to a template, and that the assignment of patches to templates determines the class of an image. Returning to general in- stances X , their probability mass will be concentrated in distinct regions of the space X , in which for every i ∈ [ N ] , the vector x i lies near µ c i for some c i ∈ [ M ] . The score functions h ∗ y are approximately constant in each such region. It is important to stress here that we do not assume statistical independence of x i ’ s, only that their possible values can be quantized into M templates µ 1 , . . . , µ M . Under our idealized assumptions on D X , the expectation in the score approximation error E can be discretized as follows: E := E X ∼D X  max y ∈Y   h y ( X ) − h ∗ y ( X )    = M X c 1 ,...,c N =1 P c 1 ,...,c N max y ∈Y   h y ( M c 1 ,...,c N ) − h ∗ y ( M c 1 ,...,c N )   (12) where M c 1 ,...,c N := ( µ c 1 , . . . , µ c N ) and P c 1 ,...,c N stands for the probability that x i lies near µ c i for ev ery i ∈ [ N ] ( P c 1 ,...,c N ≥ 0 , P c 1 ,...,c N P c 1 ,...,c N = 1 ). W e now turn to sho w that f θ 1 . . .f θ M can be chosen to separate GMM components, i.e. such that for ev ery c, d ∈ [ M ] , f θ d ( µ c ) 6 = 0 if and only if c = d . If the functions f θ are Gaussians (eq. 7 ), we can simply set the mean of f θ d to µ d , and its standard deviations to be low enough such that the function ef fecti v ely vanishes at µ c when c 6 = d . If f θ are neurons (eq. 8 ), an additional requirement is needed, namely that the GMM component means µ 1 . . . µ M be linearly separable. In other words, we require that for ev ery d ∈ [ M ] , there exist w d ∈ R s and b d ∈ R for which w > d µ c + b d is positiv e if c = d and negativ e otherwise. This may seem like a strict assumption at first glance, but notice that the dimension s is often as large, or even larger , then the number of components M . In addition, if input vectors x i are normalized to unit length (a standard practice with image patches for example), µ 1 . . . µ M will also be normalized, and thus linear separability is trivially met. Assuming we have linear separability , one may set θ d = ( w d , b d ) , and for threshold or ReLU activ ations we indeed get f θ d ( µ c ) 6 = 0 ⇐ ⇒ c = d . With sigmoid acti v ations, we may need to scale ( w d , b d ) so that w > d µ c + b d  0 when c 6 = d , and that would ensure that in this case f θ d ( µ c ) effecti vely vanishes. Assuming we hav e chosen f θ 1 . . .f θ M to separate GMM components, and plugging-in the format of h y giv en in eq. 2 , we get the follo wing con venient form for h y ( M c 1 ,...,c N ) : h y ( M c 1 ,...,c N ) = A y c 1 ,...,c N N Y i =1 f θ c i ( µ c i ) Assigning the coefficient tensors through the follo wing rule: A y c 1 ,...,c N = h ∗ y ( M c 1 ,...,c N ) Q N i =1 f θ c i ( µ c i ) 28 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S implies: h y ( M c 1 ,...,c N ) = h ∗ y ( M c 1 ,...,c N ) for ev ery y ∈ Y and c 1 . . .c N ∈ [ M ] . Plugging this into eq. 12 , we get a score approximation error of zero. T o recap, we have sho wn that when the parametric functions f θ are Gaussians (eq. 7 ) or neurons (eq. 8 ), not only are the score functions h y giv en in eq. 2 univ ersal when M → ∞ (see app. C.2 ), but they can also achiev e zero classification error (eq. 9 ) with a moderate value of M (on the order of 100) if the underlying data distribution D is “natural”. In this context, D is regarded as natural if it satisfies two conditions. The first, which is rather mild, requires that a label be completely determined by the instance. For example, an image will belong to one category with probability one, and to the rest of the categories with probability zero. The second condition, which is far more restrictive, states that input v ectors composing an instance can be quantized into a moderate number ( M ) of templates. The assumption that natural images exhibit this property is based on v arious empirical studies where it is sho wn to hold approximately . Since it does not hold exactly , our analysis is approximate, and its implication in practice is that the classification error introduced by constraining score functions to hav e the format giv en in eq. 2 , is negligible compared to other sources of error (factorization of the coef ficient tensors, finiteness of training data and dif ficulty in optimization). A ppendix D . Related W ork The classic approach for theoretically analyzing the po wer of depth focused on in vestigation of the com- putational complexity of boolean cir cuits . An early result, known as the “exponential efficiency of depth”, may be summarized as follows: for ev ery integer k , there are boolean functions that can be computed by a circuit comprising alternating layers of AND and OR gates which has depth k and polynomial size, yet if one limits the depth to k − 1 or less, an exponentially large circuit is required. See Sipser ( 1983 ) for a formal statement of this classic result. Recently , Rossman et al. ( 2015 ) have established a some what stronger result, showing cases where not only are polynomially wide shallow boolean circuits incapable of exact realization, but also of approximation (i.e. of agreeing with the target function on more than a specified fraction of input combinations). Other classical results are related to threshold cir cuits , a class of models more similar to contemporary neural networks than boolean circuits. Namely , they can be viewed as neural networks where each neuron computes a weighted sum of its inputs (possibly including bias), follo wed by threshold activ ation ( σ ( z ) = 1 [ z ≥ 0] ). For threshold circuits, the main known result in our context is the existence of functions that separate depth 3 from depth 2 (see Hajnal et al. ( 1987 ) for a statement relating to exact realization, and the techniques in Maass et al. ( 1994 ); Martens et al. ( 2013 ) for extension to approximation). More recent studies focus on arithmetic cir cuits ( Shpilka and Y ehudayoff ( 2010 )), whose nodes typically compute either a weighted sum or a product of their inputs 9 (besides their role in studying expressiv eness, deep networks of this class have been shown to support pro vably optimal training Li vni et al. ( 2014 )). A spe- cial case of this are the Sum-Product Networks ( SPNs ) presented in Poon and Domingos ( 2011 ). SPNs are a class of deep generativ e models designed to efficiently compute probability density functions. Their sum- mation weights are typically constrained to be non-negati v e (such an arithmetic circuit is called monotone ), and in addition, in order for them to be valid (i.e. to be able to compute probability density functions), addi- tional architectural constraints are needed (e.g. decomposability and completeness). The most widely known theoretical arguments regarding the efficienc y of depth in SPNs were giv en in Delalleau and Bengio ( 2011 ). In this work, tw o specific families of SPNs were considered, both comprising alternating sum and product layers – a family F whose nodes form a full binary tree, and a family G with n nodes per layer (excluding the output), each connected to n − 1 nodes in the preceding layer . The authors show that functions imple- mented by these networks require an exponential number of nodes in order to be realized by shallow (single 9. There are different definitions for arithmetic circuits in the literature. W e adopt the definition given in Martens and Medabalimi ( 2014 ), under which an arithmetic circuit is a directed acyclic graph, where nodes with no incoming edges correspond to inputs, nodes with no outgoing edges correspond to outputs, and the remaining nodes are either labeled as “sum” or as “product”. A product node computes the product of its child nodes. A sum node computes a weighted sum of its child nodes, where the weights are parameters linked to its incoming edges. 29 C O H E N S H A R I R S H A S H U A hidden-layer networks). The limitations of this work are twofold. First, as the authors note themselves, it only analyzes the ability of shallow networks to realize exactly functions generated by deep networks, and does not provide any result relating to approximation. Second, the specific SPN families considered in this work are not universal hypothesis classes and do not resemble networks used in practice. Recently , Martens and Medabalimi ( 2014 ) prov ed that there exist functions which can be ef ficiently computed by decomposable and complete (D&C) SPNs of depth d + 1 , yet require a D&C SPN of depth d or less to have super-polynomial size for exact realization. This analysis only treats approximation in the limited case of separating depth 4 from depth 3 (D&C) SPNs. Additionally , it only deals with specific separating functions, and does not con- ve y information regarding how frequent these are. In other words, according to this analysis, it may be that almost all functions generated by deep networks can be ef ficiently realized by shallow networks, and there are only few pathological functions for which this does not hold. A further limitation of this analysis is that for general d , the separation between depths d + 1 and d is based on a multilinear circuit result by Raz and Y ehudayof f ( 2009 ), that translates into a network that once again does not follow the common practices of deep learning. There have been recent attempts to analyze the efficienc y of network depth in other settings as well. The most commonly used type of neural networks these days includes neurons that compute a weighted sum of their inputs (with bias) follo wed by Rectified Linear Unit (ReLU) acti v ation ( σ ( z ) = max { 0 , z } ). Pascanu et al. ( 2013 ) and Montufar et al. ( 2014 ) study the number of linear regions that may be expressed by such networks as a function of their depth and width, thereby showing existence of functions separating deep from shallow (depth 2) networks. T elgarsky ( 2015 ) shows a simple construction of a depth d width 2 ReLU network that operates on one-dimensional inputs, realizing a function that cannot be approximated by ReLU netw orks of depth o ( d/ log d ) and width polynomial in d . Eldan and Shamir ( 2015 ) provides functions expressible by ReLU netw orks of depth 3 and polynomial width, which can only be approximated by a depth 2 network if the latter’ s width is e xponential. The result in this paper applies not only to ReLU acti v ation, but also to the standard sigmoid ( σ ( z ) = 1 / (1 + e − z ) ), and more generally , to any universal activ ation (see assumption 1 in Eldan and Shamir ( 2015 )). Bianchini and Scarselli ( 2014 ) also considers dif ferent types of activ ations, studying the topological complexity (through Betti numbers) of decision re gions as a function of network depth, width and activ ation type. The results in this paper establish the existence of deep vs. shallow separating functions only for the case of polynomial acti v ation. While the above works do address more con v entional neural networks, they do not account for the structure of con volutional networks – the most successful deep learning architectures to date, and more importantly , they too prov e only existence of some separating functions, without providing an y insight as to ho w frequent these are. W e are not the first to incorporate ideas from the field of tensor analysis into deep learning. Socher et al. ( 2013 ), Y u et al. ( 2012 ), Setiawan et al. ( 2015 ), and Hutchinson et al. ( 2013 ) all proposed different neural network architectures that include tensor-based elements, and exhibit various advantages in terms of expressi v eness and/or ease of training. In Janzamin et al. ( 2015 ), an alternativ e algorithm for training neu- ral networks is proposed, based on tensor decomposition and Fourier analysis, with pro ven generalization bounds. In Novik ov et al. ( 2014 ), Anandkumar et al. ( 2014 ), Y ang and Dunson ( 2015 ) and Song et al. ( 2013 ), algorithms for tensor decompositions are used to estimate parameters of different graphical models. No- tably , Song et al. ( 2013 ) uses the relati v ely new Hierarchical T ucker decomposition ( Hackbusch and K ¨ uhn ( 2009 )) that we employ in our work, with certain similarities in the formulations. The w orks differ consider - ably in their objecti v es though: while Song et al. ( 2013 ) focuses on the proposal of a ne w training algorithm, our purpose in this work is to analyze the expressi ve efficienc y of networks and how that depends on depth. Recently , Lebede v et al. ( 2014 ) modeled the filters in a con v olutional network as four dimensional tensors, and used the CP decomposition to construct an efficient and accurate approximation. Another work that draws a connection between tensor analysis and deep learning is the recent study presented in Haeffele and V idal ( 2015 ). This work shows that with sufficiently large neural networks, no matter how training is initialized, there e xists a local optimum that is accessible with gradient descent, and this local optimum is approximately equiv alent to the global optimum in terms of objective v alue. 30 O N T H E E X P R E S S I V E P O W E R O F D E E P L E A R N I N G : A T E N S O R A N A L Y S I S A ppendix E. Computation in Log-Space with SimNets A practical issue one faces when implementing arithmetic circuits is the numerical instability of the product operation – a product node with a large number of inputs is easily susceptible to numerical o verflo w or underflow . A common solution to this is to perform the computations in log-space, i.e. instead of computing activ ations we compute their log . This requires the activ ations to be non-negati ve to begin with, and alters the sum and product operations as follo ws. A product simply turns into a sum, as log Q i α i = P i log α i . A sum becomes what is known as log-sum-exp or softmax : log P i α i = log P i exp(log α i ) . T urning to our networks, the requirement that all activ ations be non-negati ve does not limit their univ er- sality . The reason for this is that the functions f θ are non-negati ve in both cases of interest – Gaussians (eq. 7 ) and neurons (eq. 8 ). In addition, one can always add a common offset to all coefficient tensors A y , ensuring they are positi ve without affecting classification. Non-negati v e decompositions (i.e. decompositions with all weights holding non-negati v e values) can then be found, leading all network acti v ations to be non-negati v e. In general, non-negati v e tensor decompositions may be less efficient than unconstrained decompositions, as there are cases where a non-negati v e tensor supports an unconstrained decomposition that is smaller than its minimal non-negativ e decomposition. Nev ertheless, as we shall soon see, these non-negati v e decompositions translate into a proven architecture, which was demonstrated to achiev e comparable performance to state of the art con volutional networks, thus in practice the deterioration in ef ficiency does not seem to be significant. Na ¨ ıvely implementing CP or HT model (fig. 1 or 2 respectiv ely) in log-space translates to log acti v ation following the locally connected linear transformations (conv olutions if coefficients are shared, see sec. 3.3 ), to product pooling turning into sum pooling, and to exp activ ation follo wing the pooling. Howe v er , applying exp and log activ ations as just described, without proper handling of the inputs to each computational layer , would not result in a numerically stable computation 10 . The SimNet architecture ( Cohen and Shashua ( 2014 ); Cohen et al. ( 2016 )) naturally brings forth a nu- merically stable implementation of our networks. The architecture is based on two ingredients – a flexible similarity measure and the MEX operator: MEX β ( x , b ) := 1 β log   1 N X j exp( β ( x j + b j ))   The similarity layer , capable of computing both the common con volutional operator as well as weighted l p norm, may realize the representation by computing log f θ ( x i ) , whereas MEX can naturally implement both log-sum-exp and sum-pooling ( lim β → 0 MEX β ( x , 0 ) = mean j { x j } ) in a numerically stable manner . Not only are SimNets capable of correctly and efficiently implementing our networks, but they have already been demonstrated ( Cohen et al. ( 2016 )) to perform as well as state of the art con volutional netw orks on sev eral image recognition benchmarks, and outperform them when computational resources are limited. 10. Na ¨ ıve implementation of softmax is not numerically stable, as it inv olves storing α i = exp(log α i ) directly . This howe ver can be easily corrected by defining c := max i log α i , and computing log P i exp(log α i − c ) + c . The result is identical, but now we only exponentiate negati v e numbers (no overflo w), with at least one of these numbers equal to zero (no underflow). 31

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment