Deeply-Sparse Signal rePresentations ($text{D}text{S}^2text{P}$)

A recent line of work shows that a deep neural network with ReLU nonlinearities arises from a finite sequence of cascaded sparse coding models, the outputs of which, except for the last element in the cascade, are sparse and unobservable. That is, in…

Authors: Demba Ba

Deeply-Sparse Signal rePresentations ($text{D}text{S}^2text{P}$)
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 1 Deeply-Sparse Signal rePresentations (DS 2 P) Demba Ba, Member , IEEE Abstract —A recent line of w ork shows that a deep neural network with ReLU nonlinearities arises from a finite sequence of cascaded sparse coding models, the outputs of which, except f or the last element in the cascade, are sparse and unobserv able. That is, intermediate outputs deep in the cascade are sparse, hence the title of this manuscript. W e sho w here, using techniques from the dictionary lear ning literature that, if the measur ement matrices in the cascaded sparse coding model (a) satisfy RIP and (b) all hav e sparse columns except for the last, they can be recover ed with high probability . W e propose two algorithms for this purpose: one that recov ers the matrices in a forward sequence, and another that recovers them in a backward sequence. The method of choice in deep learning to solve this problem is by training an auto-encoder . Our algorithms provide a sound alternativ e, with theor etical guarantees, as well upper bounds on sample complexity . The theory shows that the learning complexity of the forward algorithm depends on the number of hidden units at the deepest layer and the number of active neurons at that layer (sparsity). In addition, the theory relates the number of hidden units in successive layers, thus giving a practical pr escription for designing deep ReLU neural networks. Because it puts fewer restrictions on the architecture, the backward algorithm requir es mor e data. W e demonstrate the deep dictionary learning algorithm via simulations. Finally , we use a coupon-collection argument to conjectur e a lower bound on sample complexity that gives some insight as to why deep networks requir e more data to train than shallow ones. Index T erms —Dictionary Learning, Deep Neural Networks, Sample Complexity I . I N T R O D U C T I O N D Eep learning has been one of the most popular areas of research over the past few years, due in lar ge part to the ability of deep neural networks to outperform humans at a number of cognition tasks, such as object and speech recognition. Despite the mystique that has surrounded their success, recent work has started to provide answers to questions pertaining, on the one hand, to basic assumptions behind deep networks–when do they work?–and, on the other hand, to interpretability–why do they work? In [1], Patel explains deep learning from the perspectiv e of inference in a hierarchical probabilistic graphical model. This leads to ne w inference algorithms based on belief propagation and its variants. In a series of articles, the authors from [2], [3], [4] consider deep conv olutional netw orks through the lens of a multi-layer con volutional sparse coding (ML-CSC) model. The authors show a correspondence between the sparse approximation step in this multi-layer model and the encoding step (forward pass) in a related deep con v olutional network. Specifically , they sho w that con volutional neural networks with ReLU D. Ba is with School of Engineering and Applied Sciences, Harvard Univ ersity , Cambridge, MA (e-mail: demba@seas.harvard.edu) Manuscript received April 19, 2005; revised August 26, 2015. nonlinearities can be interpreted as sequential algorithms to solve for the sparse codes in the ML-CSC model. The authors carry out a detailed theoretical analysis of when the sparse recov ery algorithms succeed in the absence of noise, and when they are stable in the presence of noise. More recently , building on the work of P apyan, the work in [5] have show that some of the ke y operations that arise in deep learning (e.g. pooling, ReLU) can be understood from the classical theory of filter banks in signal processing. In a separate line of work, T ishby [6] uses the information bottleneck principle from information theory to characterize the limits of a deep network from an information-theoretic perspectiv e. The works [1] and [2], [3], [4] relate the inference step (forward pass) of neural netw orks to various generati ve mod- els, namely graphical models in the former and the ML- CSC model in the latter . They do not pro vide theoretical guarantees for learning the filters (weights) of the respectiv e generativ e models. Here, we take a more expansi ve approach than in [1], [2], [5] that connects deep networks to the theory of dictionary learning, to answer questions pertaining, not to basic assumptions and interpretability , but to the sample complexity of learning a deep network–ho w much data do you need to learn a deep network? Classical dictionary-learning theory [7] tackles the problem of estimating a single unknown transformation from data obtained through a sparse coding model. The theory gives bounds for the sample complexity of learning a dictionary as a function of the parameters of the sparse-coding model. Motiv ated by classical dictionary learning, recent work [8] shows that a two-layer auto-encoder with ReLU nonlinearity trained by backpropagation, using an infinite amount of data , learns the dictionary from a sparse-coding model. Neither classical dictionary learning theory , nor the work from [8], provide a framew ork for assessing the complexity of learning a hierarchy , or sequence, of transformations from data. W e formulate a deep version of the classical sparse-coding generativ e model from dictionary learning [7]: starting with a sparse code, we apply a composition of linear transfor- mations to generate an observ ation. W e constrain all the transformations in the composition, except for the last, to hav e sparse columns, so that their composition yields sparse representations at ev ery step. W e solve the deep dictionary learning problem–learning all of the transformations in the composition–by sequential alternating minimization. W e in- troduce two sequential algorithms. The first is a forward- factorization algorithm that learns the transformations in a forward sequence, starting with the first and ending with the last. The second is a backward-factorization algorithm that learns the transformations in a backward sequence. W e begin the rest of our treatment by briefly introducing notation (Section II). Our main contributions are the following JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 2 The connection between dictionary learning and auto- encoders First, in Section III we develop the connection between classical dictionary learning and deep recurrent sparse auto-encoders [9], [10], [11], [12]. This motiv ates a deep generalization of the sparse coding model, dev eloped in Sec- tion IV, and its connection to deep ReLU auto-encoders. Upper bounds on the sample complexity of deep sparse dictionary learning Second, we prov e in Section IV that, under regularity assumptions, both the forward and backward- factorization algorithms can learn the sequence of dictionaries in the deep model. For each algorithm, we characterize the computational complexity of this learning as a function of the model parameters. Let { r ` } L ` =1 be the the number of hidden units at layer ` (layer L is the shallowest hidden layer and layer 1 the deepest). Further let { s Y ( ` − 1) } L ` =1 be the number of active neurons at layer ` , and { s ( ` ) } L − 1 ` =1 denote the sparsity le vel of the sparse weight matrix connecting hidden layer ` + 1 to hidden layer ` . The computational com- plexity of the forward-factorization is O  max ( r 2 1 , r 1 s Y (0) )  , i.e., a function of the number of hidden units in the deep- est layer and the number acti ve neurons in that layer . In addition, the forward-factorization algorithm requires r ` = O  max ( r 2 ` +1 , r ` +1 s ( ` ) )  , ` = 1 , . . . , L − 1 , i.e., it relates the number of hidden units at a gi ven layer to that at the preceding layer, gi ving a practical prescription for designing deep ReLU architectures. As detailed in the main text, the backward-factorization algorithm requires more data because it puts less stringent constraints on the architecture. Conjecture on a lower bound on complexity Third, we use an argument based on the coupon-collector problem [13] to conjecture a lo wer bound O ( r 1 s Y (0) log r 1 ) on comple xity . This bound gi ves some insight as to why deep networks require more data to train than shallo w ones. A characterization of the properties of column-sparse random matrices Fourth, our proofs rely on a certain class of sparse matrices satisfying the restricted-isometry property (RIP) [14]. W e prov e this in Section V for random matrices using results from non-asymptotic random matrix theory [15]. A notion of RIP for product matrices Finally , our proof of con ver gence of the forward-factorization algorithm relies on a structured pr oduct of matrices satisfying RIP , i.e., a notion of near-isometry for the product of matrices. W e introduce such a notion and prove it for the matrices in volv ed in the deep sparse-coding model. The RIP was originally dev eloped for single matrices. Our result may be of independent interest. W e demonstrate the algorithms for learning the sequence of transformations in the deep sparse-coding model via simula- tion in Section VI. Similar to [7], the simulations suggest that the second-order terms in the complexity results abo ve are an artifact of the proof techniques we borrow from [7]. W e give concluding remarks in Section VII. I I . N O T A T I O N W e use bold font for matrices and vectors, capital letters for matrices, and lower -case letters for vectors. For a matrix A , a j denotes its j th column vector , and A ij its element at the i th row and the j th column. For a v ector x , x i denotes its i th element. A T and x T refer , respecti vely , to the transpose of the matrix A and that of the vector x . W e use || x || p to denote the ` p norm of the vector x . W e use σ min ( A ) and σ max ( A ) to refer , respecti vely , to the minimum and maximum singular values of the matrix A . W e will also use || A || 2 to denote the spectral norm (maximum singular value of a matrix). W e will make it clear from context whether a quantity is a random variable/v ector . W e use I to refer the identity matrix. Its dimension will be clear from the context. Let r ∈ N + . For a vector x ∈ R r , Supp ( x ) ⊂ { 1 , . . . , r } refers to set of indices corresponding to its nonzero entries. I I I . S H A L L O W N E U R A L N E T W O R K S A N D S PA R S E E S T I M A T I O N The rectifier-linear unit–ReLU–is a popular nonlinearity in the neural-networks literature. Let z ∈ R , the ReLU non- linearity is the scalar -valued function defined as ReLU( z ) = max ( z , 0) . In this section, we build a parallel between sparse approximation/ ` 1 -regularized regression and the ReLU( · ) non- linearity . This leads to a parallel between dictionary learn- ing [7] and auto-encoder networks [10], [16]. In turn, this motiv ates an interpretation of learning a deep neural network as a hierarchical, i.e., sequential, dictionary-learning problem in a generativ e model that is the cascade of sparse-coding models [2]. A. Unconstrained and non-ne gative ` 1 -r e gularization in one dimension W e begin by a deri vation of the soft-thresholding operator from the perspectiv e of sparse approximation. Let y ∈ R , λ > 0 , and consider the in verse problem min x ∈ R 1 2 ( y − x ) 2 + λ | x | . (1) It is well-known that the solution ˆ x to Equation 1 is giv en by the soft-thresholding operator , i.e., ˆ x = sgn ( y ) max ( | y |− λ, 0) : = s λ ( y ) = ReLU ( y − λ ) − ReLU ( − y − λ ) . (2) For completeness, we gi ve the deri vation of this result in the appendix. If x is non-negativ e, Equation 1 becomes min x ∈ R + 1 2 ( y − x ) 2 + λx, (3) and its solution ˆ x = max ( y − λ, 0) = ReLU ( y − λ ) . In other words, the ReLU( · ) nonlinearity arises in the solution to a simple constrained ` 1 -regularized re gression problem in one dimension. T o see this, for y > 0 , the solution coincides with that of Equation 1. For y ≤ 0 , the solution must be ˆ x = 0 . Suppose that x > 0 , then the value of the objective function is 1 2 ( y − x ) 2 + λx , which is strictly greater that 1 2 y 2 , i.e., the objectiv e function ev aluated at x = 0 . B. Unconstrained ` 1 -r e gularization in mor e than one dimen- sion The abov e results generalize easily to the case when the observations and the optimization v ariable both li ve in higher JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 3 dimensions and are related through a unitary transform. Let r ∈ N + and y ∈ R r , λ > 0 , and A be a unitary matrix. Consider the problem min x ∈ R r + 1 2 || y − Ax || 2 2 + λ || x || 1 . (4) Since A is unitary , i.e., an isometry , Equation 4 is equi v alent to min x ∈ R r + 1 2 || ˜ y − x || 2 2 + λ || x || 1 = min x ∈ R r + r X j =1 1 2 ( ˜ y j − x j ) 2 + λx j , (5) where ˜ y j = a T j y , j = 1 , . . . , r . Equation 5 is separable in x 1 , . . . , x r . For each x j , the optimization is equi valent to Equation 3 with ˜ y j as the input, j = 1 , . . . , r . Therefore, ˆ x j = ReLU  a j − λ  T  y 1  ! . (6) Fig. 1. Auto-encoder architecture motiv ated by the interpretation of the ReLU in the context of sparse approximation. The encoder solves the ` 1 - regularized least-squares problem with non-neg ativity constraints. The decoder reconstructs the observ ations by applying the dictionary to the output of the encoder . Assuming the biases are kno wn, the only parameter of this architecture is the dictionary A , which can be learned by backpropagation. T o simplify the figure, we only draw a subset of all of the connections. Equation 6 states that, for A unitary , a simple one-layer feed-forward neural network solves the in verse problem of Equation 4, and that − λ plays the role of the bias in neural networks. Applying the transformation A to the vector ˆ x yields an approximate reconstruction ˆ y = A ˆ x . W e depict this two-stage process as a two-layer feed-forward neural network in Figure 1. The architecture depicted in the figure is called an auto-encoder [10], [16], [11], [12]. Giv en training examples, the weights of the network, which depend on A , can be learned by backpropagation. This suggests a connection between dictionary learning and auto-encoder architectures, which we elaborate upon below . Remark 1 : The literature suggests that the parallel between the ReLU and sparse approximation dates to the work [9]. In [17], the authors discuss in detail the sparsity-promoting properties of the ReLU compared to other nonlinearities in neural networks. C. Sparse coding, dictionary learning, and auto-encoders W e use the relationship between ReLU and ` 1 -regularized regression to draw a parallel between dictionary learning [7] and a specific auto-encoder neural-network architecture [10], [16], [11], [12]. a) Shallow sparse gener ative model: Let Y be a d × n real-valued matrix generated as follo ws Y = AX , A ∈ R d × r , X ∈ R r × n . (7) Each column of X is an s -sparse vector , i.e., only s of its elements are non-zero [7]. The non-zero elements represent the coordinates or codes for the corresponding column of Y in the dictionary A [7]. Remark 2 : W e call this model “shallow” because there is only one transformation A to learn. b) Sparse coding and dictionary learning: Giv en Y , the goal is to estimate A and X . Alternating minimization [7] (Algorithm 1) is a popular algorithm to find A and X as the solution to ( ˆ A , ˆ X ) = arg m in X , A || x i || 1 , ∀ i = 1 , . . . , n s.t. Y = AX , || a j || 2 = 1 , ∀ j = 1 , . . . , r. (8) Algorithm 1 solv es Equation 8 by alternating between a sparse coding step, which updates the sparse codes gi ven an estimate of the dictionary , and a dictionary update step, which updates the dictionary gi ven estimates of the sparse codes. In unconstrained form, the sparse-coding step of the t th iteration of is equi valent to solving Equation 4, with a value for λ that depends on  t . W e defer a discussion of the assumptions behind Algorithm 1 to Section IV, where we introduce two deep generalizations of the algorithm. Algorithm 1: AltMinDict( Y , A (0) ,  t , s , T ): Alternating minimization algorithm for dictionary learning Input: Samples Y , initial dictionary estimate A (0) , accuracy sequence  t , sparsity lev el s , and number of iterations T . 1 for t = 0 to T − 1 do 2 for i = 1 to n do 3 X ( t + 1) i = arg min x || x || 1 s.t. || y i − A ( t ) x || 2 ≤  t 4 Threshold: X ( t + 1) = X ( t + 1) · ( I [ X ( t + 1) > 9 s t ]) 5 Estimate A ( t + 1) = YX ( t + 1) + 6 Normalize: A ( t + 1) i = a ( t +1) i || a ( t +1) i || 2 Output: ( A ( T ) , X ( T )) Suppose that instead of requiring equality in Equation 7, our goal where instead to solv e the following problem min A , X 1 2 || Y − AX || 2 F + λ n X i =1 || x i || 1 . (9) If A were a unitary matrix, the sparse-coding step could be solved e xactly using Equation 6. The goal of the dictionary- learning step is to minimize the reconstruction error between JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 4 A applied to the sparse codes, and the observations. In the neural-network literature, this two-stage process describes so- called auto-encoder architectures [10], [16]. Remark 3 : W e make the assumption that A is unitary to simplify the discussion and make the parallel between neural networks and dictionary learning more apparent. If A is not unitary , we can use the iterative soft-thresholding algo- rithm (IST A) [18]: x k = s λ M  x k − 1 + 1 M A T ( y − Ax k − 1 )  , where k indexes the iterations of the algorithm, and M ≥ σ max ( A T A ) . Fig. 2. Auto-encoder architecture motivated by alternating-minimization algorithm for sparse coding and dictionary learning. The encoder uses K ( K large) iterations of the IST A algorithm for sparse coding, starting with a guess x 0 of the sparse code. The decoder reconstructs the observations by applying the dictionary to the output of the encoder. Assuming the biases are known, the only parameter of this architecture is the dictionary A , which can be learned by backpropagation. M is a constant such that M ≥ σ max ( A T A ) . c) Shallow , constrained, r ecurr ent, sparse auto- encoders.: W e introduce an auto-encoder architecture for learning the model from Equation 7. This auto-encoder has an implicit connection with the alternating-minimization algorithm applied to the same model. Gi ven A , the encoder produces a sparse code using a finite (large) number of iterations of the IST A algorithm [18]. The decoder applies A to the output of the decoder to reconstruct y . W e call this architecture a constrained recurrent sparse auto-encoder (CRsAE) [11], [19]. The constraint comes from the fact that the operations used by the encoder and the decoder are tied to each other through A . Hence, the encoder and decoder are not independent, unlike in [10], [16]. The encoder is recurrent because of the IST A algorithm, which is an iterative procedure. Figure 2 depicts this architecture. Recent work [8] shows that a version of this auto-encoder , with on iteration of IST A, trained by backpropagation, learns the dictionary from a sparse-coding model under certain regularity assumptions. Their work, howe ver , assumes an infinite amount of data and does not consider , as we do, the problem of learning a hierarchy of dictionaries and its finite-sample comple xity . There are two basic take-aways from the pre vious discussion 1) Constrained auto-encoders with ReLU nonlinearities capture the essence of the alternating-minimization al- gorithm for dictionary learning. 2) Therefore, the sample comple xity of dictionary learning can gi ve us insights on the hardness of learning neural networks. d) How to use dictionary learning to assess the sample complexity of learning deep networks?: The depth of a neural network refers to the number of its hidden layers, excluding the output layer . A shallow network is one with two or three hidden layers [20]. A network with more than three hidden layers is typically called “deep”. Using this definition, the architecture from Figure 2 w ould be called deep. This is because of K iterations of IST A which, when unrolled [9], [10], [11], [12] would constitute K separate layers. This definition, ho wev er , does not reflect the fact that the only unknown in the network is A . Therefore, the number of parameters of the network is the same as that in a one-layer , fully-connected, feed-forward network. A popular interpretation of deep neural networks is that they learn a hierarchy , or sequence, of transformations of data. Motiv ated by this interpretation, we define the depth of a network , not in relationship to its number of layers, but as the number of underlying distinct transformations/mappings to be learned. Classic dictionary learning tackles the problem of estimating a single transformation from data [7]. Dictionary-learning theory characterizes the sample complexity of learning the model of Equation 7 under various assumptions. W e can use these results to get insights on the complexity of learning the parameters of the auto-encoder from Figure 2. Classical dictionary learning theory does not, howe ver , provide a frame- work for assessing the complexity of learning a hierarchy , or sequence, of transformations from data. I V . D E E P S PA R S E S I G N A L R E P R E S E N TA T I O N S Our goal is to build a deep (in the sense defined previously) version of the model from Equation 7, i.e., a generativ e model in which, starting with a sparse code, a composition of linear transformations are applied to generate an observ ation. What properties should such a model obey? In the previous section, we used the sparse generativ e model of Equation 7 to motiv ate the auto-encoder architecture of Figure 2. The goal of the encoder is to pr oduce sparse codes [17]. W e will construct a deep version of the auto-encoder and use it to infer desirable properties of the deep generative model. A. Deep, constrained, r ecurr ent, sparse auto-encoders For simplicity , let us consider the case of two linear trans- formations A (1) and A (2) . A (1) is applied to a sparse code x , and A (2) to its output to generate an observation y . Applied to y , the goal of the IST A ( A (2) ) encoder is to produce sparse codes. One scenario when this will happen is if A (1) x , i.e., the image of A (1) is sparse/approximately sparse. Under this scenario, The IST A ( A (1) ) encoder would then use the sparse output of the IST A ( A (2) ) encoder to produce an estimate ˆ x of x , completing the steps of the encoder . The decoder a then approximates y with ˆ y = A (2) A (1) ˆ x , leading to a two-layer generalization of the architecture from Figure 1. For the composition of more than two transformations, the requirement that the encoders applied in cascade produce sparse codes suggests that, starting with a sparse code, the output of each of the transformations, e xpect for the very last which giv es the observ ations, must be approximately sparse. As pointed out by the authors in [21], this is a suf ficient but not necessary condition. Different from our synthesis perspecti ve, the authors take an analysis perspectiv e that relies on the JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 5 notion of cosparsity [22]. W e do not consider the cosparse setting. W e are in a position to specify our deep sparse generati ve model, along with the assumptions that accompany the model. B. Deep sparse generative model and coding Let Y be the d L × n real-valued matrix obtained by applying the composition of L linear transformations { A ( ` ) } L ` =1 to a matrix X of sparse codes Y = A ( L ) · · · A (2) A (1) X , (10) X ∈ R r 1 × n , A ( ` ) ∈ R d ` × r ` ; ∀ ` = 1 , . . . , L − 1 , all columns of A ( ` ) are s ( ` ) sparse, and its nonzero entries uniformly bounded. If we further assume that each column of X is s -sparse, i.e., at most s of the entries of each column are nonzero, the image of each of the successive transformations { A ( ` ) } L − 1 ` =1 will also be sparse. Finally , we apply the transformation A ( L ) to obtain the observations Y . W e let A (0) = X , and s = s (0) . Giv en Y , we would like solv e the following problem min X , { A ( ` ) } L ` =1 || x i || 1 , ∀ i = 1 , . . . , n s.t. Y = A ( L ) · · · A (2) A (1) X ,       a ( ` ) j       2 = 1 , ∀ j = 1 , . . . , r ` ; ∀ ` = 1 , . . . , L. (11) 1) Connection with shallow dictionary learning: If A (2) = A (3) = · · · = A ( L ) = I , the deep sparse generati ve model of Equation 10 becomes the shallo w sparse generati ve model of Equation 7. Therefore, the deep dictionary-learning problem of Equation 11 reduces to the shallow dictionary learning problem of Equation 8, a problem that is well-studied in the dictionary-learning literature [7], and for which the au- thors from [7] propose an alternating-minimization procedure, Algorithm 1. In [7], the authors show that, under certain assumptions, the output A ( T ) of Algorithm 1 con ver ges to A . The three key assumptions for con vergence are that A satisfies the restricted isometry property (RIP) [14], the initialization A (0) is close to A (in a precise sense), and that {  t } forms a decreasing sequence. 2) Reduction to a sequence of shallow pr oblems: In the next section, we introduce tw o algorithms to learn the dictio- naries { A ( ` ) } L ` =1 giv en Y , by sequential application of Algo- rithm 1. W e call the first algorithm the “forward-factorization” algorithm because it learns the dictionaries in a forward sequence, i.e., beginning with A (1) and ending with A ( L ) . The second algorithm, termed “backw ard-factorization” algorithm, learns the dictionaries in a backward sequence. In what follows, it will be useful to define two sets of matrices whose properties will be of interest in our theoretical analyses of the algorithms we introduce for deep dictionary learning. The first set of matrices comprises A ( ¯ ` → L ) = L Y ` = ¯ ` A ( ` ) , ∀ ¯ ` = 1 , . . . , L. (12) Giv en a layer index ¯ ` = 1 , . . . , L , A ( ¯ ` → L ) is the product all dictionaries from later ¯ ` to layer L . In this notation, A ( L → L ) = A ( L ) . The second set of matrices comprises Y ( ` ) = ` Y ` 0 =1 A ( ` ) X , ` = 1 , . . . , L − 1 . (13) Giv en a layer index ` = 1 , . . . , L − 1 , Y ( ` ) is the output of the ` th operator in Equation 10. At depth ` , the columns of Y ( ` ) , ` = 1 , . . . , L − 1 are sparse representations of the signal Y , i.e., they are deeply sparse . Each column of Y ( ` ) represent the hidden units/neurons at layer ` of the corresponding e xample, and its nonzero entries the activ e units/neurons. W e let Y (0) = X . C. Learning the deep sparse coding model by sequential alternating minimization 1) F orward-factorization algorithm: Algorithm 2 specifies the steps of the forward-factorization algorithm. The procedure relies on the sequential application of Algorithm 1. T o gain some intuition, let us consider the case of L = 3 . W e first learn the product A (1 → 3) = A (3) A (2) A (1) . Having learned this product, we then use it to learn A (2 → 3) = A (3) A (2) , which automatically yields A (1) . Finally , we use A (3) A (2) to learn A (2) and A (3) . The sequence in which the forward-factorization algorithm learns the dictionaries is simi- lar in spirit to the multi-layer sparse approximation algorithm (Algorithm 1) from [23]. F or the dictionary update step, [23] uses a projected gradient algorithm to enforce the constraints on the columns of the dictionaries, while we use least-squares, followed by normalization. Algorithm 2: F orward-factorization algorithm for deep dictionary learning Input: Samples Y , number of lev els 1 ≤ ¯ ` ≤ L , initial dictionary estimates { A ( ` → L ) (0) } ¯ ` ` =1 , accuracy sequences {  ( ` → L ) t } ¯ ` ` =1 , sparsity lev els { s ( ` − 1) } ¯ ` ` =1 . 1 ˆ A (0 → L ) = Y 2  ˆ A (1 → L ) , ˆ X  = AltMinDict ( ˆ A (0 → L ) , A (1 → L ) (0) ,  (1 → L ) t , s, ∞ ) 3 ` = 2 4 while ` ≤ ¯ ` do 5  ˆ A ( ` → L ) , ˆ A ( ` − 1)  = AltMinDict ( √ s ( ` − 1) ˆ A ( ` − 1 → L ) , A ( ` → L ) (0) ,  ( ` → L ) t , s ( ` − 1) , ∞ ) 6 ` = ` + 1 Output: { ( ˆ A ( ` → L ) , ˆ A ( ` − 1) ) } ¯ ` ` =1 W e now state e xplicitly assumptions on the deep generativ e model of Equation 10 that will guarantee the success of Algorithm 2 for arbitrary L . The reader can compare these assumptions to assumptions A1–A7 from [7]. (A1) Pr oduct Dictionary Matrices satisfying RIP : F or each ` = 1 , . . . , L , the product dictionary matrix A ( ` → L ) has 2 s ( ` − 1) -RIP constant δ 2 s ( ` − 1) < 0 . 1 . JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 6 (A2) Spectral Condition of Product Dictionary Matrices: For each ` = 1 , . . . , L , the dictionary matrix A ( ` → L ) has bounded spectral norm, i.e., for some constant µ ( ` → L ) > 0 ,     A ( ` → L )     2 < µ ( ` → L ) q r ` d L . (A3a) Non-zer o Entries of X : The non-zero entries of X are drawn i.i.d. from a distribution such that E [( X ij ) 2 ] = 1 , and satisfy the following a.s.: | X ij | ≤ M (0) , ∀ , i, j . (A3b) Non-zer o Entries of { A ( ` ) } L − 1 ` =1 : ∀ ` = 1 , . . . , L − 1 , the non-zero entries of A ( ` ) are dra wn i.i.d. from a distribu- tion with mean equal to zero, such that E [( A ( ` ) ij ) 2 ] = 1 s ( ` ) , and satisfy the following a.s.: | √ s ( ` ) A ( ` ) ij | ≤ M ( ` ) , ∀ , i, j . (A4) Sparse Coefficient Matrix : Recalling that X = A (0) , ∀ ` = 0 , . . . , L − 1 , the columns of A ( ` ) hav e s ( ` ) non-zero entries which are selected uniformly at random from the set of all s ( ` ) -sized subsets of { 1 , . . . , r ` +1 } . W e further require that, for ∀ ` = 0 , . . . , L − 1 , s ( ` ) ≤ d 1 / 6 L c 2 µ ( ` +1 → L ) 1 / 3 . (A5a) Sample Complexity : Giv en the failure parameter δ (1 → L ) > 0 , the number of samples n needs to satisfy n = O  max ( r 2 1 , r 1 M 2 (0) s ) log  2 r 1 δ (1 → L )  . (14) (A5b) Scaling Law f or Number of Hidden Units : Gi ven the failure parameters { δ ( ` → L ) } L ` =2 > 0 , the size of the hidden layers r ` needs to satisfy , ∀ ` = 1 , . . . , L − 1 r ` = O  max ( r 2 ` +1 , r ` +1 M 2 ( ` ) s ( ` ) ) log  2 r ` +1 δ ( ` +1 → L )  . (15) (A6) Initial Dictionary with Guaranteed Error Bound : It is assumed that, ∀ ` = 1 , . . . , L , we have access to an initial dictionary estimate A ( ` → L ) such that max i min z ∈{− 1 , 1 }       z A ( ` → L ) i (0) − A ( ` → L ) i       2 ≤ 1 2592 s 2 ( ` − 1) . (16) (A7) Choice of Parameters for Alternating Minimization : For all ` = 1 , . . . , L , line 5 of Algorithm 2 uses a sequence of accuracy parameters  ( ` → L ) 0 = 1 2592 s 2 ( ` − 1) and  ( ` → L ) t +1 = 25050 µ ( ` → L ) s 3 ( ` − 1) √ d L  ( ` → L ) t . (17) Comparing these assumptions to A1–A7 from [7], the least trivial one is assumption A1, namely that the product matrices { A ( ` → ) } L ` =1 satisfy RIP [14]. The RIP was introduced for sin- gle matrices and not products of matrices. One of our contrib u- tions is to show that if the sparse matrices A (1) , . . . , A ( L − 1) , as well as the (potentially) dense matrix A ( L ) satisfy RIP of some order, then the product matrices will satisfy RIP . Formally , Theor em 1 (RIP-like pr operty of A ( ¯ ` → L ) ): Suppose that for each ` = 1 , . . . , L , the dictionary matrix A ( ` ) satisfies RIP of order 2 s Y ( ` − 1) with constant δ 2 s Y ( ` − 1) . Then ∀ ¯ ` = 1 , . . . , L and 2 s Y ( ¯ ` − 1) -sparse y ( ¯ ` − 1) , L Y ` = ¯ ` (1 − δ 2 s Y ( ` − 1) )       y ( ¯ ` − 1)       2 2 ≤       A ( ¯ ` → L ) y ( ¯ ` − 1)       2 2 ≤ L Y ` = ¯ ` (1 + δ 2 s Y ( ` − 1) )       y ( ¯ ` − 1)       2 2 . (18) W e prov e the theorem, and make further remarks in the appendix. The theorem assumes that there exists matrices whose columns ha ve a fixed sparsity le vel, and hence dependent entries, that satisfy RIP (assumption A1). In Section V, we show that, with high probability , sparse random matrices that obey assumption A3b and A4 satisfy RIP . W e will appeal to standard results from random matrix theory [15]. From the analysis in Section V, assumption A2 follows from the concentration of the singular values of such matrices. W e no w state our main result on the ability of the forward- factorization algorithm, Algorithm 2, to reco ver the sequence of dictionaries from the deep sparse generati ve model. Theor em 2 (Exact reco very of the deep generative model by Algorithm 2): Let E ` → L denote the ev ent { ˆ A ( ` → L ) = A ( ` → L ) } , ` = 1 , . . . , L . Let 1 ≤ ¯ ` ≤ L , then ∀ ¯ ` P [ ∩ ¯ ` ` =1 E ` → L ] ≥ ¯ ` Y ` =1 (1 − 2 δ ( ` → L ) ) . (19) The proof of the theorem is in the appendix. If we apply the theorem with ¯ ` = L , it states that, with the giv en probability , we can learn all of the product matrices in the deep sparse generativ e model, and hence all L dictionaries. Assumption A5a is a statement about the complexity of this learning. i.e., the amount of data required: the computational complexity is O  max ( r 2 1 , r 1 M 2 (0) s ) log  2 r 1 δ (1 → L )  . In neural network terminology , r 1 is the dimension of the deepest layer’ s output, i.e., the number of hidden units in that layer . The sparsity s represents the (tar get) number of activ e units at that layer . Assumption A5b states that a sufficient condition for the forward-f actorization algorithm to succeed is that r ` = O  max ( r 2 ` +1 , r ` +1 M 2 ( ` ) s ( ` ) ) log  2 r ` +1 δ ( ` +1 → L )  . In other words, the number of hidden units at given layer ought to be a function of the number of hidden units at the preceding layer . T ogether , we can interpret the assumptions of the deep sparse generative model, as well as assumptions A5a and A5b, as prescriptions for how to design deep versions of the auto-encoders from Section III. W e can summarize these prescriptions as follows 1) The user first picks the number of hidden units r L in the first layer . This layer can hav e more units than the dimension d L of the data. The matrix of weights connecting the data to the hidden units from the first layer can be dense. 2) F ollowing the first layer , the network ought to expand. That is, the user ought to progressively increase the number of hidden units at all following layers according to assumption A5b . Moreover , the matrices of weights JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 7 connecting hidden units from consecutiv e pairs of layers ought to be sparse. 3) Finally , according to assumption A5a, the user ought to use training data of size proportional to the number of hidden units in the deepest layer . Similar to [7], our simulations (Section VI) suggest that quadratic terms in assumptions A5a and A5b are artifacts of the techniques from [7] for proving the con ver gence of Algorithm 1 in the shallow case. Specifically , the second-order terms come from the proof of Lemma A.6 in [7]. That is, the simulations suggest that the learning complexity depends on product of the number of activ e neurons in the deepest layer and the number of hidden units in that layer . Moreover , after picking the number of hidden units in the first layer, the simulations suggest that the number of hidden units at a giv en layer ought to increase linearly with both the target number of activ e units in that layer (sparsity), and the number of hidden units in its input layer . An interesting line of research would be to improve the bounds from Lemma A.6 in [7]. Assumptions A6 states that the error bound of the sparse- coding step ought to decrease as a function of iterations: this is natural if indeed the estimates of the dictionaries con ver ge to the true dictionaries, so should the error of the sparse coding step, which comes from using the “wrong” dictionary . Finally , assumption A7 states ho w close we should initialize dictionaries as a function of the model parameters. 2) Backwar d-factorization algorithm: Algorithm 3 speci- fies the steps of the backward-f actorization algorithm. Let us gain some intuition by considering the case of L = 2 . First, we learn A (2) and Y (1) using Algorithm 1 which, under regularity assumptions, guarantees that, with high probability , ( ˆ A (2) , ˆ Y (1) ) = ( A (2) , Y (1) ) . Then, we solve for A (1) and X similarly . Appealing to Theorem 3.1 from [7], we can conclude that, with high probability , we ha ve solv ed for A (2) , A (1) and X . The second step assumes that the sparse matrix A (1) satisfies RIP . W e sho w this in Section V. Algorithm 3: Backward-f actorization algorithm for deep dictionary learning Input: Samples Y , number of lev els 1 ≤ ¯ ` ≤ L , initial dictionary estimates { A ( ` ) (0) } L ` = ¯ ` , accuracy sequences {  ( ` ) t } L ` = ¯ ` , sparsity lev els { s Y ( ` − 1) } L ` = ¯ ` , scaling sequence { σ (0 → ` − 1) } L ` = ¯ ` . 1 ˆ Y ( L ) = Y 2 ` = L 3 while ` ≥ ¯ ` do 4  ˆ A ( ` ) , ˆ Y ( ` − 1)  = AltMinDict ( σ (0 → ` − 1) ˆ Y ( ` ) , A ( ` ) (0) ,  ( ` ) t , s Y ( ` − 1) , ∞ ) 5 ` = ` − 1 Output: { ˆ A ( ` ) , ˆ Y ( ` − 1) } L ` = ¯ ` W e no w state assumptions on the deep generative model of Equation 10 guaranteeing that Algorithm 3 can successfully recov er all the dictionaries. a) Assumptions:: Let Y (0) = X , s Y (0) = s , and ∀ ` = 1 , . . . , L , s Y ( ` ) = s Q ` ` 0 =1 s ( ` ) , ` = 1 , . . . , L − 1 . Further define the scalar sequence { σ (0 → ` ) } L − 1 ` =0 as follows σ (0 → 0) = 1 , (20) σ (0 → ` ) = ` Y ` 0 =0 √ s ( ` 0 ) . (21) For each ` = 1 , . . . , L − 1 , the purpose of σ (0 → ` ) is to scale the sparse matrix Y ( ` ) so that its nonzero entries ha ve variance centered around 1 (we detail this in the appendix, where we sketch of the proof of conv ergence of the algorithm). (B1) Dictionary Matrices satisfying RIP : For each ` = 1 , . . . , L , the dictionary matrix A ( ` ) has 2 s Y ( ` − 1) -RIP constant of δ 2 s Y ( ` − 1) < 0 . 1 . (B2a) Spectral Condition of all Dictionaries: For each ` = 1 , . . . , L , the dictionary matrix A ( ` ) has bounded spectral norm, i.e., for some constant µ ( ` ) > 0 ,     A ( ` )     2 < µ ( ` ) q r ` d ` . (B2b) Spectral Condition of Sparse Dictionaries: For each ` = 1 , . . . , L − 1 , the transpose of the dictionary matrix A ( ` ) has bounded smallest singular v alue, i.e., for some constant ˜ µ ( ` ) > 0 , σ min ( A ( ` ) T ) > ˜ µ ( ` ) q r ` d ` . (B2c) Spectral Condition of Indicator Matrices of Sparse Dictionaries: For each ` = 1 , . . . , L − 1 , the spectral norm of the matrix U ( ` ) of indicator v alues of the dictionary matrix A ( ` ) satisfies     U ( ` )     2 ≤ 2 r s 2 ( ` ) r ` r ` +1 . (B3) Non-zer o Entries in Coefficient Matrix : The non-zero entries of X are drawn i.i.d. from a distribution with mean zer o such that E [( X ij ) 2 ] = 1 , and satisfy the follo wing a.s.: | X ij | ≤ M (0) , ∀ , i, j . (B4) Sparse Coefficient Matrix : The columns of the coef fi- cient matrix ha ve s non-zero entries which are selected uniformly at random from the set of all s -sized subsets of { 1 , . . . , r 1 } . It is required that s ≤ d 1 / 6 1 c 2 µ (1) 1 / 3 , for some universal constant c 2 . W e further require that, for ∀ ` = 1 , . . . , L − 1 , s Y ( ` ) ≤ d 1 / 6 ` +1 c 2 µ ( ` +1) 1 / 3 . (B5) Sample Complexity : Giv en failure parameters { δ ` } L ` =1 > 0 , the number of samples n needs to be O  max ` max  r 1 r ` s Y ( ` − 1) s log ( 2 r 1 δ ` ) , r ` M 2 Y ( ` − 1) s Y ( ` − 1) log ( 2 r ` δ ` )  . (22) Here | σ (0 → ` ) Y ( ` ) ij | ≤ M Y ( ` − 1) , ` = 1 , . . . , L − 1 , and | Y (0) ij | = | X ij | ≤ M (0) . (B6) Initial Dictionary with Guaranteed Error Bound : It is assumed that, ∀ ` = 1 , . . . , L , we have access to an initial dictionary estimate A ( ` ) such that max i ∈{ 1 ,...,r ` } min z ∈{− 1 , 1 }       z A ( ` ) i (0) − A ( ` ) i       2 ≤ 1 2592 s 2 Y ( ` − 1) . (23) (B7) Choice of Parameters for Alternating Minimization : For all ` = 1 , . . . , L , line 4 of Algorithm 3 uses a JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 8 sequence of accuracy parameters  ( ` ) 0 = O  1 2592 s 2 Y ( ` − 1)  and  ( ` ) t +1 = 25050 µ ` s 3 Y ( ` − 1) √ d `  ( ` ) t . (24) W e are now in a position to state a theorem on the ability of the backward-f actorization algorithm to learn the deep generativ e model of Equation 10, i.e., recover { A ( ` ) } L ` =1 under assumptions B1–B7. Theor em 3 (Exact r ecovery of the deep generative model by Algorithm 3): Let us denote by E ` the event { ˆ A ( ` ) = A ( ` ) } , ` = 1 , . . . , L . Let 1 ≤ ¯ ` ≤ L , then ∀ ¯ ` P [ ∩ L ` = ¯ ` E ` ] ≥ L Y ` = ¯ ` (1 − 2 δ ` ) . (25) The Theorem states that, with the gi ven probability , we can learn all of the L transformations in the deep sparse generativ e model. Random matrices satisfy assumptions B1, B2a, B2b with high probability . This is well known for dense matrices. Section V sho ws this for a class of sparse random matrices with fixed column sparsity . Assumption B2c also holds with high probability for these matrices (Lemma A.1 from [7]). Compared to the proof of Theorem 2, the proof of Theorem 3 is less immediate. W e giv e a sketch of the proof in the appendix. Comparing the complexities of Algorithms 2 and 3 :In Equation 22, the term corresponding to ` = 1 is the same as Equation 14. Therefore, in the absence of the maximum ov er layers in Equation 22, Algorithms 2 and Algorithm 3 would hav e the same complexity . Because of the maximum ov er layers, howe ver , Algorithm 3 has complexity bigger than Algorithm 2, i.e., it requires more data. Intuitively , this is because assumptions B1–B7 puts fe wer constraints on the deep generati ve model than assumptions A1–A7 do. Indeed, assumption A5b is a stronger constraint on the relationship between the number of hidden units in successive layers than assumption B2b which only requires that, for all ` = 1 , . . . , L − 1 , r ` ≥ r ` +1 . In other words, compared to Theorem 2, Theorem 3 applies to a larger set of deep generativ e models. Some limitations of Theorem 2 and Theorem 3 : One limi- tation of the theorems is that some of the assumptions require the matrices from Equation 10 to ha ve very large dimensions. In other words, for small (order 1000 ) v alues of the dimen- sions, some of the assumptions cannot necessarily be met. In assumption A1, for instance, the larger L , the more stringent the conditions on RIP constants of the individual dictionaries that would guarantee that the product dictionaries satisfy the assumption. This also applies to assumption B1, in which the sparsity of y ( ` ) decreases with increasing ` and L , making it harder to satisfy . Another limitation is assumption B4 on s Y ( ` ) , which cannot be satisfied easily for small dimensions. Assumptions A7 and B7, which require the initial dictionaries to be close to the true ones are also limiting. Ho wev er, close initialization is a standard assumption in conv ergence analyses of dictionary-learning algorithms. Finally , Theorem 2 assumes that the sparse dictionaries are random, which may be limiting. Similar to [24], [25], which takes a Bayesian approach to dictionary learning, we can think of this assumption as a prior on the sparse dictionaries. Conjecture on a lower bound on complexity : Our analysis of Algorithms 2 and 3 gi ves upper bounds on the number of examples sufficient to learn the dictionaries from the deep sparse coding model. Here, we use a so-called “coupon- collection” argument to conjecture on a lower bound. The coupon-collector’ s problem [13] is a classical one that asks the question: giv en r coupons, in expectation, how many coupons n does one need to draw without replacement in order to hav e drawn each coupon at least once? In the context of the shallo w sparse coding model (Equation 7, Equation 10 L = 1 ), the r columns of the dictionary play the role of the coupons and the number of draws n giv es a lower bound on the number of examples required for learning: for successful dictionary learning, we need to observe each column of the dictionary at least ones. In [26], the authors use a gener- alization of the coupon-collector’ s problem [13], in which one selects s coupons without replacement at every dra w , to argue that a lower -bound for the number of examples required for dictionary learning is O ( r s log r ) . The authors proceed to prov e this lower bound for the case of a squar e dictionary . T o our knowledge, there is no similar analysis for the case of an ov ercomplete dictionary and noiseless observations. While there e xist analyses for the noisy case based on information- theoretic arguments [27], the accompanying results are not not meaningful in the noiseless case for which the SNR is infinite. Using a coupon-collection argument, let us start with a lower bound for the backward-factorization algorithm (Algo- rithm 3). In the end, the lower bound ought to be independent of the sequence in which we perform the factorization and, in fact, we argue that it is. The lower bound for this algorithm is O (max ` r ` s Y ( ` − 1) log r ` ) . Since r 1 ≥ r 2 · · · ≥ r L , and s = s Y (0) ≤ s Y (1) ≤ · · · s Y ( L − 1) , the lower bound O ( r 1 s log r 1 ) . For the forward-factorization algorithm (Algorithm 2), the first step of the algorithm, which yields the product matrix, dictates the sample comple xity n . The coupon-collection argument yields a lower bound O ( r 1 s log r 1 ) . It is not hard to see that the sequence in which one obtains the dictionary is not important for the lo wer bound. This is because, ultimately , any factorization must obtain, as part of its steps, the matrix of sparse codes X . The coupon-collection argument suggests a lo wer bound of O ( r 1 s log r 1 ) to obtain this matrix. The ef fect of the sequence in which one factors the production dictionary simply results in constraining the relationship between the number of hidden units at different layers. V . C O N C E N T R A T I O N O F E I G E N V A L U E S O F C O L U M N - S P A R S E R A N D O M M ATR I C E S W I T H D E P E N D E N T S U B - G AU S S I A N E N T R I E S The proof of our main results, Theorems 2 and 3, rely on sparse matrices satisfying RIP [14]. In this section, we sho w that a class of random sparse matrices indeed satisfies RIP . JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 9 A. Sparse random sub-Gaussian matrix model Let A ∈ R d × r be a matrix with r columns ( a j ) r j =1 . Let U ∈ R d × r be a binary random matrix with columns ( u j ) r j =1 ∈ { 0 , 1 } d that are i.i.d. s A -sparse binary random vectors each obtained by selecting s A entries from u i without replacement, and letting U ij be the indicator random variable of whether a gi ven entry j = 1 , . . . , d was selected. Let V ∈ R d × r be a random matrix with i.i.d. entries distrib uted according to a zero-mean sub-Gaussian random v ariable V with v ariance 1 , | V | ≤ 1 almost surely , and sub-Gaussian norm || V || ψ 2 –we adopt the notation ||·|| ψ 2 from [15] to denote the sub-Gaussian norm of a random v ariable. W e consider the following generati ve model for the entries of A : A ij = r d s A U ij V ij , i = 1 , . . . , d ; j = 1 , . . . , r. (26) It is not hard to verify that the random matrix A thus obtained is such that E [ a i a T i ] = I . T o see this, we note the follo wing properties of the generative model for A • P [ U ij = 1] = s A d . • Let j 6 = j 0 , P [ U ij = 1 ∩ U ij 0 = 1] = s A d · s A − 1 d − 1 . • E [ A 2 ij ] = E   q d s A U ij V ij  2  = d s A · P [ U ij = 1] · E [ V 2 ij ] = 1 . • Let j 6 = j 0 , E [ A ij A ij 0 ] = E [ U ij U ij 0 V ij V ij 0 ] = E [ U ij U ij 0 ] E [ V ij V ij 0 ] | {z } 0 = 0 . • || a j || 2 2 ≤ s A · d s A = d a.s, j = 1 , . . . , r . Ultimately , we would like to understand the concentration be- havior of the singular v alues of 1) A T , and 2) sub-matrices of A that consist of a sparse subset of columns (RIP-like results). W e fist recall the following result from non-asymptotic random matrix theory [15], and apply it obtain a concentration result on the singular values of the matrix A T . Theor em 4 (Restatement of Theor em 5.39 fr om [15] (Sub- Gaussian r ows)): Let W ∈ R r × d matrix whose ro ws { ( W T ) j } r j =1 ( ( W T ) j is the j th column of W T ) are indepen- dent sub-Gaussian isotropic random v ectors in R d . Then for ev ery t ≥ 0 , with probability at least 1 − exp ( − ct 2 ) one has √ r − C √ d − t ≤ σ min ( W ) ≤ σ max ( W ) ≤ √ r + C √ d + t. (27) Here, C = C K , c = c K ≥ 0 depend only on the sub-Gaussian norm K = max j     ( W T ) j     ψ 2 of the rows. Before we can apply the above result to W = A T , we need to demonstrate that the columns of A are sub-Gaussian random vectors, defined as follows Definition 5 (Definition 5.22 fr om [15] (Sub-Gaussian r an- dom vectors)): W e say that a random vector x in R d is sub- Gaussian if the one-dimensional marginals h x , z i are sub- Gaussian random v ariables for all z in R d . The sub-Gaussian norm of x is defined as || x || ψ 2 = sup z ∈S d − 1 ||h x , z i|| ψ 2 . (28) Theor em 6 (The columns of A are sub-Gaussian random vectors): For ev ery j = 1 , . . . , r , a j is a sub-Gaussian random vector . Moreover , || a j || 2 ψ 2 ≤ d s A C 1 || V || 2 ψ 2 , (29) where C 1 is a univ ersal constant. The proof of the theorem is in the appendix. B. Concentration pr operties of minimum and maximum sin- gular values of A W e now have all of the requisites to apply Theorem 4 to the matrix A defined in Equation 26. Lemma 7: Let A be the sparse random matrix obtained according to Equation 26. There exist univ ersal constants c and C such that with probability at least 1 − exp  − ct 2 )  √ r (1 − δ ) ≤ σ min ( A T ) ≤ σ max ( A T ) ≤ √ r (1 + δ ) (30) and         1 r AA T − I         2 ≤ max ( δ, δ 2 ) , δ = C r d r + t √ r . (31) Pr oof 1: The first part of the Lemma follows from applying the Theorem 4 to A T from Equation 26. The second part follows from Lemma 5.36 in [15] sho wing that the equi valence of the first and second parts. C. Near-isometry pr operties of subsets of columns of A Consider a subset T 0 ⊂ { 1 , 2 , . . . , r } s.t. | T 0 | = s , and let A T 0 denote the d × s matrix corresponding to the subset of columns of A from Equation 26 indexed by T 0 . W e want an RIP-like result of A T 0 , i.e., we seek a bound for the dif- ference between the spectral norm of A T T 0 A T 0 (appropriately normalized) and the identity matrix. Definition 8 (Restatement of Definition fr om [15] (Restricted isometries)): A d × r matrix A satisfies the restricted isometry property of order s ≥ 1 if there exists δ s ≥ 0 such that the inequality (1 − δ s ) || x || 2 2 ≤ || Ax || 2 2 ≤ (1 + δ s ) || x || 2 2 (32) holds for all x ∈ R r with | Supp ( x ) | ≤ s . The smallest number δ s = δ s ( A ) is called the restricted isometry constant of A . W e first recall a result from [15] which, along with the results from the previous section, will give us the desired bound. The result applies to one of two models for A . Row-independent model : the rows of A are independent sub- Gaussian isotropic random vectors in R r . Column-independent model : The columns a j of A are in- dependent sub-Gaussian isotropic random v ectors in R d with || a j || 2 = √ d a.s. The column-independent model is the one of interest here. Theor em 9 (Restatement of Theor em 5.65 fr om [15] (Sub- Gaussian r estricted isometries)): Let A ∈ R d × r sub-Gaussian random matrix with independent rows or columns, which follows either of the two models abo ve. Then the normalized JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 10 matrix ¯ A = 1 √ d A satisfies the following for every sparsity lev el 1 ≤ s ≤ r and ev ery number δ ∈ (0 , 1) : if d ≥ C δ − 2 s log ( er /s ) , then δ s ( ¯ A ) ≤ δ (33) with probability at least 1 − 2 exp ( − cδ 2 d ) . Here, C = C K , c = c K ≥ 0 depend only on the sub-Gaussian norm K = max j || a j || ψ 2 of the rows or columns of A . W e can apply this theorem to A from Equation 26 to conclude that, with a sufficient number of measurements d , ¯ A satisfies the RIP of order s . Lemma 10: Let A be the sparse random matrix obtained according to Equation 26. Then the normalized matrix ¯ A = 1 √ d A satisfies the following for ev ery sparsity lev el 1 ≤ s ≤ r and ev ery number δ ∈ (0 , 1) : if d ≥ C δ − 2 s log ( er /s ) , then δ s ( ¯ A ) ≤ δ (34) with probability at least 1 − 2 exp ( − cδ 2 d ) , where C = C K , c = c K ≥ 0 depend only on the bound from Equation 61 on the sub-Gaussian norm || a j || ψ 2 of the columns of A . Pr oof 2: By construction, A follo ws the column- independent model. In addition, we have proved in Theorem 6 that the columns of A are sub-Gaussian random vectors. The result follo ws from applying Theorem 9 to A from Equation 26. Remark 4 : Note that ¯ A ij = 1 √ d √ d √ s A U ij V ij = 1 √ s A U ij V ij . V I . S I M U L AT I O N S W e use simulated data to demonstrate the ability of Al- gorithms 2 and 3 to learn the deep generati ve model from Equation 10, and to assess their sensitivity to the distance between the initial dictionaries and true ones. In addition, we compare the algorithms to an auto-encoder architecture designed specifically to learn the deep generati ve model. T o be clear , the simulations are not meant to be exhausti ve. Their goal is to demonstrate some aspects of the theoretical results. a) Simulations.: As both algorithms are computationally demanding, in that the y require the solutions to man y con vex optimization problems, we consider the case when L = 2 . W e used the simulation studies from [7] to guide our choice of parameters 1) W e chose A (2) to be of size 100 × 200 , i.e., d 2 = 100 and r 2 = 200 . W e chose its entries to be i.i.d. N (0 , 1 d 2 ) . 2) W e chose A (1) to be of size 200 × 800 , i.e., d 1 = 200 and r 1 = 800 . W e chose its entries according to the sparse random matrix model from Equation 26, letting { V ij } 200 , 800 i =1 ,j =1 be i.i.d. Rademacher random variables ( + / − 1 with equal probability). Each column of A (1) has sparsity lev el s (1) = 3 . 3) W e chose n = 6400 , so that X is of size 800 × 6400 . Similar to [7], we chose the non-zero entries of X to be i.i.d U ([ − 2 , − 1] ∪ [1 , 2] ), and picked s = 3 (sparsity lev el of each column of X ). 4) F or a giv en matrix A and its estimate ˆ A , we use the following error metric to assess the success of the algorithm at learning A err ( ˆ A , A ) = max i s 1 − h a i , ˆ a i i 2 k a i k 2 2 k ˆ a i k 2 2 . (35) 0 2 4 6 8 10 12 14 16 Iteration n um b er (T) 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 err  A (1) , ˆ A (1)  err  A (2) , ˆ A (2)  err  A (1 → 2) , ˆ A (1 → 2)  err  A (2 → 2) , ˆ A (2 → 2)  Fig. 3. Semi-log plot of the error between the true dictionaries and the recovered ones as a function of the number of iterations of Algorithm 1. The figure shows that the recovered dictionaries conv erge to the true ones linearly (e xponentially). W e refer the reader to the main text for additional interpretations of the simulations. N.B. : Note the fact that n is much smaller than r 2 1 and r 2 2 . Y et, the simulations will demonstrate that the dictionaries can be learned exactly . As in [7], this highlights the fact that the second-order terms in the complexity bounds is an artifact of the proof techniques from [7], which form the basis for our own proofs. b) Initialization.: T o initialize Algorithms 3 and 2, we need to initialize A (1) , A (2) and A (2 → 2) . For a generic matrix A of size d × r , following the simulation studies from [7], we let A (0) = A + Z , where entries of Z are i.i.d. α · N (0 , 1 d ) , α ∈ (0 , 1) . W e define the signal-to-noise ratio (SNR) of A (0) as follows SNR = 10 log 10  σ 2 A σ 2 Z  = − 10 log 10 α 2 . (36) c) Implementation.: W e implemented Algorithms 2 and 3 in the Python programming language. W e used cvxpy [28] with the MOSEK [29] solver to find the solutions to the optimization problems from the inner loop of the alternating-minimization procedure (Algorithm 1). The authors in [7] use the GradeS [30] algorithm for this inner loop because it is faster . In our experience, cvxpy was much more stable numerically . The inner loop of Algorithm 1 is embarrassingly parallelizable. Therefore, we also implemented a distributed version of our algorithm using the Python module dask [31]. The code is hosted on bitbucket . The author is happy to make it av ailable upon request 1 . d) Results.: W e assessed the ability of Algorithms 2 and 3 to recover their respecti ve dictionaries, as well as the sensiti vity of the algorithms to the SNR of the initial dictionaries. Dictionary learning at moderate SNR : W e applied the al- gorithms to data simulated as described pre viously . W e gen- erated the initial dictionaries so that their SNR equals 6 dB, which corresponds to α = 0 . 5 . Figure 3 depicts the error between the true dictionaries and the ones recovered using Algorithm 2 and Algorithm 3. W e obtained ˆ A (1 → 2) 1 https://bitbucket.or g/demba/ds2p/src/master/ JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 11 and ˆ A (2 → 2) using Algorithm 3, while ˆ A (1) and ˆ A (2) were obtained using Algorithm 3. The difference in the number of iterations comes from our choice of termination criterion at a value on the order of 10 − 5 . − 3 0 3 6 9 SNR (dB) 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 err  A (1) , ˆ A (1)  err  A (2) , ˆ A (2)  err  A (1 → 2) , ˆ A (1 → 2)  err  A (2 → 2) , ˆ A (2 → 2)  Fig. 4. Semi-log plot of the error between the true dictionaries and the recovered ones as a function of the SNR of the initial dictionaries. The figure shows that both algorithms are robust to initialization errors for moderate to high SNRs. W e refer the reader to the main te xt for additional interpretations of the simulations. The figure demonstrates the linear (exponential) con ver - gence of the recov ered dictionary to the true ones. The error curve for ˆ A (2) is consistent with the simulation studies from [7]. In this e xample ( L = 2 ), A (2 → 2) and A (2) are the same matrix. The error rate appears to be faster for the former compared to the latter . W e hypothesize that this is because the sparsity-lev el parameter for A (2 → 2) is s (1) = 3 , while that for A (2) is s · s (1) = 9 . The amount of data required in Theorem 3.1 from [7] scales linearly with the sparsity le vel. The error rate for A (1) is much faster than all of the others, i.e., it would appear that A (1) is easier to learn. This should not be surprising as A (1) is a matrix that is v ery sparse ( 3 -sparse in fact). Therefore, its columns hav e far fe wer that d 1 = 100 degrees of freedom. The proof techniques utilized in [7], and which we rely upon, do not explicitly take into account the sparsity of A (1) . All that the theorem giv es is an upper bound on the rate of con ver gence, without an accompan ying lower bound. Studying the effect of sparsity of the matrix of interest in dictionary learning would be an interesting area of further inquiry . Overall, our simulations demonstrate the ability of Algorithms 2 and 3 to recover the deep generativ e model from Equation 10. W e would like to stress that Figure 3 required on the order of 5 hours to generate on a PC with 8 GB of RAM and 4 cores with 2 threads each. Sensitivity to SNR of initialization : W e repeated the simu- lations for 5 equally-spaced v alues of the SNR of the initial dictionaries, beginning at − 3 dB and ending at 9 dB. Figure 4 plots the error between the true dictionaries and the ones recov ered as a function of SNR. W e ran all algorithms for T = 10 iterations. The figure demonstrates that both Algorithm 2 and 3 perform well at moderate to high SNRs. Consistent with the findings from Figure 3, the forward-factorization algorithm con verges faster overall and does better at high SNRs. The backward factorization algorithm appears to be more robust at lo w SNRs. T o estimate A (2) , the backward- factorization algorithm uses n = 6400 examples, while the forward-factorization algorithm only uses r 1 = 800 examples. This would e xplain why the former performs better at low SNRs. While, in practice, the simulations may or may not satisfy all assumptions from the theorems, they provide a useful demonstration of the algorithms. Fig. 5. Auto-encoder architecture for learning the dictionaries in the deep- sparse coding model with L = 2 . The encoder performs K iterations of the ML-FIST A algorithm. The decoder reconstructs the observations by applying the dictionaries to the output of the encoder. Comparison with an auto-encoder with shar ed weights : Recent work [8] suggests that a two-layer auto-encoder , whose encoder performs one iteration of IST A [18], and whose decoder is linear and share weights with the encoder (Figure 2, K = 1 ), can learn the dictionary in the shallow sparse-coding model approximately . It is reasonable, then, to assume that a multi-layer generalization of the architecture from ought to be able to the dictionaries from the deep-sparse coding model approximately . Figure 5 proposes such an architecture. The encoder performs multi-layer FIST A [32] (ML-FIST A). ML-FIST A is a generalization of the multi-layer IST A [32] (ML-IST A) algorithm to solve min x ∈ R r 1 1 2 || y − A 2 A 1 x || 2 2 + λ 2 || y (1) || 1 z }| { || A 1 x || 1 + λ 1 || x || 1 . (37) ML-IST A generalizes the IST A algorithm for the shallow sparse coding to an algorithm for deep sparse coding. ML- FIST A is a fast version of ML-IST A. Similar to the interpre- tation of IST A as a recurrent architecture, we can interpret ML-FIST A as a recurrent network with weights A 1 and A 2 and soft-thresholding (two-sided ReLU nonlinearities). The encoder of Figure 5 unfolds K iterations of ML-FIST A. The decoder is linear and shares weights with the encoder . It first applies A 1 and then A 2 to the output of the decoder . W e train the auto-encoder by mini-batch stochastic gradient descent and the AD AM optimizer [33] with a learning rate of 0 . 1 . W e used K = 1000 , λ 1 = and λ 2 = 0 . 01 . W e performed a grid search and found these values to be the ones that produce, respectively , sparse codes ˆ x and ˆ y (1) that approximate the ground-truth codes from the simulations the best. W e trained for 150 epochs and used a batch size of 40 examples. Similar to IST A (Figure 2), ML-FIST A uses parameters M 1 and M 2 for its ReLUs, both of which we set equal to 10 . Figure 6 shows the result of training the ML- FIST A auto-encoder . The SNR of the initial dictionary is 6 dB. The figure shows that training the architecture decreases the error from its initial value of ≈ 0 . 5 to v alues of ≈ 0 . 2 and ≈ 0 . 1 for A 1 and A 2 respectiv ely . These values are orders of magnitude higher than those achieved by Algorithms 2 and 3. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 12 These results are consistent with the experiments from [8]. T o our knowledge, there is no analysis in the literature of auto- encoder architectures for learning dictionaries from a deep sparse coding model, nor are there experiments assessing the ability of various architectures to learn the dictionaries from such models. W e hypothesize that the reason why the errors plateau as a function of epoch is because the values of λ 1 and λ 2 are constant. These two parameters play the role of the  t sequences in Algorithms 2 and 3 and, therefore, ought to decrease, both as a function of ML-FIST A unfoldings [34], and as a function of epoch [7]. Unfortunately , a theoretical characterization of how to do this is absent from the literature at this time. 0 20 40 60 80 100 120 140 Ep o c h n um b er 10 − 1 2 × 10 − 1 3 × 10 − 1 4 × 10 − 1 err  A (1) , ˆ A (1)  err  A (2) , ˆ A (2)  Fig. 6. Semi-log plot of the error between the true dictionaries and the ones recovered by the ML-FIST A auto-encoder as a function of epoch. V I I . D I S C U S S I O N W e ha ve pro vided insights as to the comple xity of learning deep ReLU neural networks by b uilding a link between deep recurrent auto-encoders [10] and classical dictionary learning theory [7]. W e used this link to dev elop a deep v ersion of the classical sparse coding model from dictionary learning. Starting with a sparse code, a cascade of linear transformations are applied in succession to generate an observation. Each transformation in the cascade is constrained to have sparse columns, except for the last. W e de veloped two sequential alternating-minimization algorithms to learn this deep gen- erativ e model from observations and proved that, under as- sumptions stated in detail, the algorithms are able to learn the underlying transformations in the cascade that generated the observations. The computational comple xity of the algorithms is a function of the dimension of the inputs of each of the transformations in the cascade (number of hidden units), the sparsity le vel of these inputs (number of acti ve neurons), and the sparsity of the columns of the sparse matrices (sparsity of weight matrix in sparse layers). In particular , the computational complexity of the forward- factorization is O  max ( r 2 1 , r 1 s Y (0) )  , i.e., a function of the number of hidden units in the deepest layer and the number ac- tiv e neurons in that layer . In addition, the forw ard-factorization algorithm requires r ` = O  max ( r 2 ` +1 , r ` +1 s ( ` ) )  , ` = 1 , . . . , L − 1 , i.e., it relates the number of hidden units at a gi ven layer to that at the preceding layer , giving a practical prescription for designing deep ReLU architectures. The complexity of the backward-factorization algorithm is O  max ` max ( r 2 1 , r ` s Y ( ` − 1) )  : it requires more data because it puts less stringent constraints on the architecture. The proofs rely on a certain family of sparse matrices satisfying the RIP . W e studied the properties of random versions of this family of matrices using results from non-asymptotic random matrix theory . W e used a coupon-collection [13] argument to conjecture a lo wer bound on sample complexity . This argument posits that we O ( r 1 s log r 1 ) for learning all of the matrices in the deep sparse coding model, where r 1 is the number of hidden units at the deepest layer , and s the number of acti ve neurons. This lo wer bound may explain why deep neural networks require more data to train. In the deep sparse coding model, each layer gi ves a sparse representation of the input. The deeper the layer , the sparser , and hence the more specific, the representation. Consider a scenario in which we wish to use the sparse representations at dif ferent depths for a classification task. The deeper the representation, the sparser it is, and hence the easier the classification task, as it would require fe wer features. W e require O ( r ` s Y ( ` − 1) log r ` ) examples to learn the all the weights up to layer ` and, hence, the sparse representations at layer ` . The deeper we go, the sparser the representation and, therefore, the more data we need. Whether a deep network outperforms a shallower one may depend on whether it can result in a sparser , more specific, representation that is useful for a classification task. W e used simulations to validate the algorithms we propose for learning the dictionaries in the deep sparse coding model. The simulations sho w that the proposed algorithms can suc- cessfully learn a sequence of dictionaries and that they are robust to initialization at a wide range of SNRs. W e also showed that the algorithms are superior to an auto-encoder architecture motiv ated by the multi-layer FIST A algorithm. A C K N O W L E D G M E N T S I would like to thank Ms. Bahareh T olooshams for her assistance with the neural-network experiments. I would also like to thank the anonymous referees for their feedback. V I I I . A P P E N D I X A. Derivation of soft-thresholding oper ator The solution ˆ x must satisfy y = x + λ∂ | x | . (38) Case y > λ : The idea is to show that, if y > λ , then necessarily x > 0 . Suppose for a contradiction that x < 0 , then Equation 38 yields y = x − λ < 0 , a contradiction. Similarly , if x = 0 , then y = λ∂ x , where the sub-gradient ∂ x ∈ [ − 1 , 1] , a contradiction since y > λ . Therefore, if y > λ , Equation 38 yields ˆ x = y − λ . Case y < λ : This case proceeds similarly as above. Case | y | ≤ λ : Suppose, for contradiction, that x > 0 , then Equation 38 yields y > λ . Similarly , if x < 0 , we obtain y < − λ . Therefore, if | y | ≤ λ , ˆ x = 0 . T ogether , the three cases above yield the soft-thresholding function of Equation 2 as the solution to Equation 1. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 13 B. Pr oof of Theorem 1 W e proceed by induction on ¯ ` . Pr oof 3: Base case: ¯ ` = L . The theorem is true for this case by assumption A1. Induction: Suppose the theorem is true for ¯ ` , we will show that it holds true for ¯ ` − 1 . Let y ( ¯ ` − 2) be a 2 s Y ( ¯ ` − 2) -sparse vector       A ( ¯ ` − 1 → L ) y ( ¯ ` − 2)       2 2 =       A ( ¯ ` → L ) A ( ¯ ` − 1) y ( ¯ ` − 2)       2 2 . (39) A ( ¯ ` − 1) y ( ¯ ` − 2) is a 2 s Y ( ¯ ` − 1) -sparse v ector , allowing us to apply our inductiv e hypothesis L Y ` = ¯ ` (1 − δ 2 s Y ( ` − 1) )       A ( ¯ ` − 1) y ( ¯ ` − 2)       2 2 ≤       A ( ¯ ` → L ) A ( ¯ ` − 1) y ( ¯ ` − 2)       2 2 ≤ L Y ` = ¯ ` (1 + δ 2 s Y ( ` − 1) )       A ( ¯ ` − 1) y ( ¯ ` − 2)       2 2 . (40) The result follo ws by assumption A1 since A ( ¯ ` − 1) satisfies the RIP of order 2 s Y ( ¯ ` − 2) . A direct consequence of the theorem is that ∀ ¯ ` = 1 , . . . , L , the RIP constant of A ( ¯ ` → L ) must be smaller than or equal to max  1 − Q L ` = ¯ ` (1 − δ 2 s Y ( ` − 1) ) , Q L ` = ¯ ` (1 + δ 2 s Y ( ` − 1) ) − 1  . C. Pr oof of Theorem 2 Pr oof 4: W e proceed by induction on ¯ ` . Base case: ¯ ` = 1 . In this case, Y = A (1 → L ) X . Theorem 3.1 from [7] guarantees that ˆ A (1 → L ) , the limit as T → ∞ of A (1 → L ) ( T ) con ver ges to A (1 → L ) with probability at least 1 − 2 δ (1 → L ) . Therefore, P [ E 1 → L ] ≥ 1 − 2 δ 1 → L , pro ving the base case. Induction: Suppose the Theorem is true for ¯ ` , we will show that is true for ¯ ` + 1 . Conditioned on the event ∩ ¯ ` ` =1 E ` → L , ˆ A ( ¯ ` → L ) = A ( ¯ ` → L ) = A ( ¯ ` +1 → L ) A ( ¯ ` ) . (41) The nonzero entries of A ( ¯ ` ) do not ha ve v ariance 1 , which is one of the assumptions from Theorem 3.1 of [7]. That’ s why , we scale ˆ A ( ¯ ` → L ) in line 5 of Algorithm 2 by √ s ¯ ` √ s ¯ ` ˆ A ( ¯ ` → L ) = √ s ¯ ` A ( ¯ ` → L ) (42) = √ s ¯ ` A ( ¯ ` +1 → L ) A ( ¯ ` ) (43) = A ( ¯ ` +1 → L )  √ s ¯ ` A ( ¯ ` )  . (44) The nonzero entries of √ s ¯ ` A ( ¯ ` ) hav e v ariance 1 . Therefore, applying Theorem 3.1 from [7], the limit ˆ A ( ¯ ` +1 → L ) as T → ∞ of A ( ¯ ` +1 → L ) ( T ) con ver ges to A ( ¯ ` +1 → L ) with probability at least 1 − 2 δ ( ¯ ` +1 → L ) . Hence, P [ ∩ ¯ ` +1 ` =1 E ` → L ] = P [ E ¯ ` +1 → L | ∩ ¯ ` ` =1 E ` → L ] P [ ∩ ¯ ` ` =1 E ` → L ] = (1 − 2 δ ( ¯ ` +1 → L ) ) Q ¯ ` ` =1 (1 − 2 δ ( ` → L ) ) = Q ¯ ` +1 ` =1 (1 − 2 δ ( ` → L ) ) . This completes the proof. D. Sketch of pr oof of Theor em 3 W e will prove the result by induction on ¯ ` . Compared to Theorem 2, the proof of Theorem 3 is less immediate. Indeed, the the proof of Theorem 2 follows from application of Theorem 3.1 from [7] and induction, assuming the product matrices satisfy RIP (Section V). The proof of Theorem 3 requires us to fill in some steps for proofs of the main results from [7] that lead to Theorem 3.1. W e sketch how to fill in these steps below . Before proceeding with the proof, let us discuss in detail the case when L = 2 in Equation 10 and ¯ ` = 1 . W e focus on exact recovery of A (2) and A (1) and defer computation of the probability in Equation 25 to the proof that will follow . Intuition behind the proof: Algorithm 3 begins by solving for  ˆ A (2) , ˆ Y (1)  . If we can show that the algorithm succeeds for this pair, in particular that ˆ A (2) = A (2) , then it follows that ˆ A (1) = A (1) in the following iteration of the algorithm. This is because, if the first iteration were to succeed, then Y (1) = A (1) X , which is the very model of Equation 7, which was treated in detail [7]. Focusing on Y = A (2) Y (1) , the key point we would like to explain is that, in Equation 7, the properties of X that allow the authors from [7] to prov e their main result also apply to σ (0 → 1) Y = √ s (1) A (2) Y (1) = A (2) ( √ s (1) Y (1) ) . This is not directly obvious because √ s (1) Y (1) is the product of a sparse matrix and the matrix X of codes. W e be gin with a few remarks on the properties of √ s (1) Y (1) . Deriving a version of Lemma 3.2 from [7] : Lemma 3.2 re- lies on Lemmas A.1 and A.2, which give bounds for the matrix of indicator values for the nonzero entries of X . For √ s (1) Y (1) = √ s (1) A (1) X , we can replace, in the proof of Lemma 3.2, the matrix of indicators of its nonzero entries with the product of the matrices of indicators of the nonzero entries for A (1) and X respectively . Using assumption B2c, this yields a bound that now depends on the n , r 2 and the sparsity lev el s · s (1) . Bounded singular values : Lemmas A.1 and A.2 from [7] giv e bounds on the largest and smallest singular values of X T . T ogether with assumptions B2a and B2b, this yields bounds for the largest and the smallest singular values of √ s (1) Y (1) T σ min ( √ s (1) Y (1) T ) ≥ ˜ µ (1) r ns (1) s 4 r 2 (45)       √ s (1) Y (1) T       2 ≤ µ (1) 2 s n ( s (1) s ) 2 r 2 , (46) where both ˜ µ (1) and µ (1) are O (1) [15]. A version of Lemmas A.3 and A.4 for √ s (1) Y (1) directly follows. Uniformly bounded variance : Since the columns of X are i.i.d., so are those of Y (1) . Moreover , since both the entries of A (1) and X are bounded by assumptions, so are those of Y (1) . Let us compute the v ariance of √ s (1) Y (1) ij . First note that √ s (1) Y (1) ij = √ s (1) e T i y (1) j = √ s (1) e T i A (1) x j , (47) JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 14 where e i is the i th element of the standard basis in ∈ R r 2 . Since the nonzero entries of X j hav e mean zero by assumption B3, the entries of X j are uncorrelated. This is because the expected value of the product of such entries is the expected value of the product of dependent Bernoulli random v ariables and independent mean zer o random variables. Therefore, the variance of √ s (1) Y (1) ij is s (1) E [( e T i A (1) x j ) 2 ] =       A (1) T e i       2 2 s (1) s r 1 , (48) where we ha ve used the fact that, since the nonzero entries of X j hav e mean zero, they are uncorrelated, and we have substi- tuted the v ariance s r 1 of the entries of X j . Using assumptions B2a and B2b giv es uniform bounds from above and below , respectiv ely , on the v ariance of √ s (1) Y (1) ij ˜ µ 2 (1) s (1) s r 2 ≤ E [( √ s (1) Y (1) ij ) 2 ] ≤ µ 2 (1) s (1) s r 2 , (49) A direct consequence of this are versions of Lemma A.5 and Lemma A.6 that apply to √ s (1) Y (1) . The abo ve discussion gives all the necessary ingredients for a version of Lemma 3.3, the center piece of [7], applied to √ s (1) Y (1) . W e can now to apply Theorem 3.1 from [7] to √ s (1) Y = A (2) ( √ s (1) Y (1) ) , guaranteeing recovery of A (2) . The interested reader can verify all of the abov e. A detailed technical exposition of these points would lead to a tedious and unnecessary digression, without adding much intuition. Using induction, it can be shown that the above remarks apply to σ (0 → ` − 1) Y ( ` ) = A ( ` ) ( σ (0 → ` − 1) Y ( ` − 1) ) for all ` = 1 , . . . , L − 1 . Pr oof 5: W e proceed by induction on ¯ ` . Base case: ¯ ` = L . In this case, σ (0 → L − 1) Y = A ( L ) ( σ (0 → L − 1) Y ( L − 1) ) . Follo wing the remark abov e, σ (0 → L − 1) Y ( L − 1) obeys the properties of X from Theorem 3.1 in [7]. This theorem guarantees that ˆ A ( L ) , the limit as T → ∞ of A ( L ) ( T ) conv erges to A ( L ) with probability at least 1 − 2 δ L . Therefore, P [ E L ] ≥ 1 − 2 δ L , proving the base case. Induction: Suppose the Theorem is true for ¯ ` , we will show that is true for ¯ ` − 1 . Conditioned on the event ∩ L ` = ¯ ` E ` , ˆ Y ( ¯ ` − 1) = Y ( ¯ ` − 1) and σ (0 → ¯ ` − 2) Y ( ¯ ` − 1) = A ( ¯ ` − 1) ( σ (0 → ¯ ` − 2) Y ( ¯ ` − 2) ) . Therefore, applying Theorem 3.1 from [7], the limit ˆ A ( ¯ ` − 1) as T → ∞ of A ( ¯ ` − 1) ( T ) con ver ges to A ( ¯ ` − 1) with probability at least 1 − 2 δ ¯ ` − 1 . Therefore, P [ ∩ L ` = ¯ ` − 1 E ` ] = P [ E ¯ ` − 1 | ∩ L ` = ¯ ` E ` ] P [ ∩ L ` = ¯ ` E ` ] = (1 − 2 δ ¯ ` − 1 ) Q L ` = ¯ ` (1 − 2 δ ` ) = Q L ` = ¯ ` − 1 (1 − 2 δ ` ) . This completes the proof. E. Pr oof of Theorem 6 Pr oof 6: W e show this by bounding     p s A d a j     ψ 2 :         r s A d a j         ψ 2 = sup z ∈S d − 1         h r s A d a j , z i         ψ 2 (50) = sup z ∈S d − 1 sup p ≥ 1 E h    P d i =1 z i U ij V ij    p i 1 /p √ p (51) = sup z ∈S d − 1 sup p ≥ 1 E ( U ij ) d i =1 E "    P d i =1 z i U ij V ij    p      ( U ij ) d i =1 #! 1 /p √ p (52) ≤ sup z ∈S d − 1 sup p ≥ 1 E ( U ij ) d i =1 E "    P d i =1 z i U ij V ij    p      ( U ij ) d i =1 # 1 /p √ p (53) ≤ sup z ∈S d − 1 E ( U ij ) d i =1 sup p ≥ 1 E "    P d i =1 z i U ij V ij    p      ( U ij ) d i =1 # 1 /p √ p . (54) Conditioned on ( U ij ) d i =1 , sup p ≥ 1 E [ | P d i =1 z i U ij V ij | p ] 1 /p √ p is the sub-Gaussian norm of the sum of s A independent sub- Gaussian random variables. Therefore, according to Lemma 5.9 in [15],        sup p ≥ 1 E "    P d i =1 z i U ij V ij    p      ( U ij ) d i =1 # 1 /p √ p        2 (55) =             X i ∈{ 1 ,...,s A } z i V ij             2 ψ 2 (56) ≤ C 1 || V || 2 ψ 2 X i ∈{ 1 ,...,s A } z 2 i (57) ≤ C 1 || V || 2 ψ 2 || z || 2 2 . (58) Putting Equation 58 back into Equation 54 yields || a j || ψ 2 ≤ r d s A sup z ∈S d − 1 E ( U ij ) d i =1 h √ C 1 || V || ψ 2 || z || 2 i (59) = r d s A sup z ∈S d − 1 √ C 1 || V || ψ 2 || z || 2 (60) ≤ r d s A √ C 1 || V || ψ 2 . (61) This completes the proof. R E F E R E N C E S [1] A. B. Patel, T . Nguyen, and R. G. Baraniuk, “ A probabilistic theory of deep learning, ” arXiv preprint , 2015. [2] V . Papyan, Y . Romano, and M. Elad, “Conv olutional neural networks analyzed via con volutional sparse coding, ” The J ournal of Machine Learning Researc h , vol. 18, no. 1, pp. 2887–2938, 2017. [3] V . Papyan, J. Sulam, and M. Elad, “W orking locally thinking globally: Theoretical guarantees for con volutional sparse coding, ” IEEE T ransac- tions on Signal Pr ocessing , vol. 65, no. 21, pp. 5687–5701, 2017. [4] J. Sulam, V . Papyan, Y . Romano, and M. Elad, “Multilayer conv olutional sparse modeling: Pursuit and dictionary learning, ” IEEE T ransactions on Signal Pr ocessing , vol. 66, no. 15, pp. 4090–4104, 2018. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, A UGUST 2015 15 [5] J. C. Y e, Y . Han, and E. Cha, “Deep con volutional framelets: A general deep learning framework for inv erse problems, ” SIAM Journal on Imaging Sciences , vol. 11, no. 2, pp. 991–1048, 2018. [6] N. Tishby and N. Zaslavsky , “Deep learning and the information bottleneck principle, ” in Information Theory W orkshop (ITW), 2015 IEEE . IEEE, 2015, pp. 1–5. [7] A. Agarwal, A. Anandkumar , P . Jain, and P . Netrapalli, “Learning sparsely used ov ercomplete dictionaries via alternating minimization, ” SIAM Journal on Optimization , vol. 26, no. 4, pp. 2775–2799, 2016. [8] T . V . Nguyen, R. K. W ong, and C. Hegde, “On the dynamics of gradient descent for autoencoders, ” in The 22nd International Conference on Artificial Intelligence and Statistics , 2019, pp. 2858–2867. [9] K. Gregor and Y . LeCun, “Learning fast approximations of sparse cod- ing, ” in Pr oceedings of the 27th International Confer ence on Machine Learning (ICML-10) , 2010, pp. 399–406. [10] J. T . Rolfe and Y . LeCun, “Discriminative recurrent sparse auto- encoders, ” arXiv preprint , 2013. [11] B. T olooshams, S. Dey , and D. Ba, “Scalable con volutional dictionary learning with constrained recurrent sparse auto-encoders, ” in 2018 IEEE 28th International W orkshop on Mac hine Learning for Signal Pr ocessing (MLSP) . IEEE, 2018, pp. 1–6. [12] T . Chang, B. T olooshams, and D. Ba, “Randnet: deep learning with compressed measurements of images, ” in 2019 IEEE 29th International W orkshop on Mac hine Learning for Signal Pr ocessing (MLSP) . IEEE, 2019, pp. 1–6. [13] E. Anceaume, Y . Busnel, and B. Sericola, “New results on a generalized coupon collector problem using markov chains, ” Journal of Applied Pr obability , vol. 52, no. 2, pp. 405–418, 2015. [14] E. J. Candes et al. , “The restricted isometry property and its implications for compressed sensing, ” Comptes r endus mathematique , vol. 346, no. 9-10, pp. 589–592, 2008. [15] R. V ershynin, “Introduction to the non-asymptotic analysis of random matrices, ” arXiv preprint , 2010. [16] H. Sreter and R. Giryes, “Learned conv olutional sparse coding, ” arXiv pr eprint arXiv:1711.00328 , 2017. [17] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectifier neural networks, ” in Proceedings of the F ourteenth International Conference on Artificial Intelligence and Statistics , 2011, pp. 315–323. [18] A. Beck and M. T eboulle, “ A fast iterative shrinkage-thresholding algo- rithm for linear in verse problems, ” SIAM journal on imaging sciences , vol. 2, no. 1, pp. 183–202, 2009. [19] B. T olooshams, S. De y , and D. Ba, “Deep residual auto-encoders for expectation maximization-inspired dictionary learning, ” pp. 1–13, 2019, [20] S. Theodoridis, Machine learning: a Bayesian and optimization perspec- tive . Academic Press, 2015. [21] A. Aberdam, J. Sulam, and M. Elad, “Multi-layer sparse coding: The holistic way , ” SIAM J ournal on Mathematics of Data Science , v ol. 1, no. 1, pp. 46–77, 2019. [22] S. Nam, M. E. Davies, M. Elad, and R. Gribonv al, “The cosparse analysis model and algorithms, ” Applied and Computational Harmonic Analysis , vol. 34, no. 1, pp. 30–56, 2013. [23] L. Le Magoarou and R. Gribon val, “Chasing butterflies: In search of efficient dictionaries, ” in 2015 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2015, pp. 3287–3291. [24] J. T . Parker , P . Schniter, and V . Cevher , “Bilinear generalized approxi- mate message passingpart i: Deri vation, ” IEEE T ransactions on Signal Pr ocessing , vol. 62, no. 22, pp. 5839–5853, 2014. [25] ——, “Bilinear generalized approximate message passingpart ii: Appli- cations, ” IEEE T ransactions on Signal Pr ocessing , vol. 62, no. 22, pp. 5854–5867, 2014. [26] D. A. Spielman, H. W ang, and J. Wright, “Exact recovery of sparsely- used dictionaries, ” in Conference on Learning Theory , 2012, pp. 37–1. [27] A. Jung, Y . C. Eldar , and N. G ¨ ortz, “On the minimax risk of dictionary learning, ” IEEE T ransactions on Information Theory , vol. 62, no. 3, pp. 1501–1515, 2016. [28] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling lan- guage for conve x optimization, ” Journal of Machine Learning Researc h , vol. 17, no. 83, pp. 1–5, 2016. [29] “The MOSEK optimization software. ” [Online]. A v ailable: http: //www .mosek.com/ [30] R. Garg and R. Khandekar, “Gradient descent with sparsification: an iterativ e algorithm for sparse recovery with restricted isometry property , ” in Proceedings of the 26th Annual International Conference on Machine Learning . ACM, 2009, pp. 337–344. [31] Dask De velopment T eam, Dask: Library for dynamic task scheduling , 2016. [Online]. A vailable: http://dask.pydata.or g [32] J. Sulam, A. Aberdam, A. Beck, and M. Elad, “On multi-layer basis pursuit, ef ficient algorithms and con volutional neural netw orks, ” IEEE transactions on pattern analysis and machine intelligence , 2019. [33] D. P . Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” in Pr oc. the 3rd International Confer ence on Learning Representations (ICLR) , 2014, pp. 1–15. [34] X. Chen, J. Liu, Z. W ang, and W . Y in, “Theoretical linear conv ergence of unfolded ista and its practical weights and thresholds, ” in Advances in Neural Information Processing Systems , 2018, pp. 9061–9071.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment