On the Local Correctness of L^1 Minimization for Dictionary Learning

On the Lo cal Correctness of ` 1 -minimization for Dictionary Learning Quan Geng ∗ ,  , Huan W ang † ,  , and John W righ t  ∗ Departmen t of Electrical Engineering, Univ ersit y of Illinois at Urbana-Champaign † Departmen t of Computer Science, Y ale Univ ersit y  Visual Computing Group, Microsoft Researc h Asia Octob er 30, 2018 Abstract The idea that man y important classes of signals can b e well-represen ted by linear com bi- nations of a small set of atoms selected from a given dictionary has had dramatic impact on the theory and practice of signal pro cessing. F or practical problems in which an appropriate sparsifying dictionary is not known ahead of time, a v ery popular and successful heuristic is to searc h for a dictionary that minimizes an appropriate sparsity surrogate o v er a giv en set of sample data. While this idea is app ealing, the b eha vior of these algorithms is largely a mystery; although there is a b o dy of empirical evidence suggesting they do learn very eﬀective represen- tations, there is little theory to guaran tee when they will b eha ve correctly , or when the learned dictionary can be exp ected to generalize. In this paper, w e take a step tow ards such a theory . W e show that under mild hypotheses, the dictionary learning problem is lo cally w ell-p osed: the desired solution is indeed a lo cal minimum of the ` 1 norm. Namely , if A ∈ R m × n is an incoheren t (and p ossibly ov ercomplete) dictionary , and the co eﬃcien ts X ∈ R n × p follo w a ran- dom sparse mo del, then with high probability ( A , X ) is a lo cal minim um of the ` 1 norm ov er the manifold of factorizations ( A 0 , X 0 ) satisfying A 0 X 0 = Y , provided the n um b er of samples p = Ω( n 3 k ). F or ov ercomplete A , this is the ﬁrst result showing that the dictionary learning problem is lo cally solv able. Our analysis dra ws on tools developed for the problem of completing a low-rank matrix from a small subset of its entries, which allow us to ov ercome a num b er of tec hnical obstacles; in particular, the absence of the restricted isometry prop ert y . 1 1 In tro duction T o a great extent, progress in signal pro cessing ov er the past four decades has b een driven by the quest for ever more eﬀective signal representations. The dev elopment of increasingly pow erful, rele- v ant representations for natural images, from F ourier and DCT bases [ANR74] to W a velets [MG84], Curv elets [CDDY06] and b ey ond, has signiﬁcantly enric hed our understanding of the structure of images, and has also spurred the dev elopment of inﬂuential practical co ding standards [W al91]. Because of this, hand design of signal representations has been a dominant paradigm in signal pro- cessing and applied mathematics. Indeed, it is diﬃcult to o v erstate the in tellectual and practical impact of this quest. Ho wev er, there are voices of dissen t. One comp eting train of though t, dating at least back to the adven t of the Karhunen-Lo ` ev e transform in the 1970’s, suggests that rather than meticulously 1 This work was partially p erformed while Q. Geng and H. W ang were interns in the Visual Computing Group at Microsoft Research Asia. The authors would like to thank Dan Spielman of Y ale Universit y and Yi Ma of MSRA for helpful discussions. 1 designing an appropriate representation for each class of signals w e encounter, it ma y b e p ossible to simply learn an appropriate represen tation from large sets of sample data. This idea has several app eals: Giv en the recent proliferation of new and exotic types of data (images, videos, web and bioinformatic data, ect.), it may not be p ossible to inv est the intellectual eﬀort required to dev elop optimal representations for eac h new class of signal we encounter. At the same time, data are b e- coming increasingly high-dimensional, a fact which stretches the limitations of our h uman intuition, p oten tially limiting our ability to develop eﬀective data representations. It may b e p ossible for an automatic pro cedure to disco v er useful structure in the data that is not readily apparent to us. Spurred by this promise, researchers ha ve inv ested a great amount of eﬀort in dev eloping al- gorithms that can automatically derive go od represen tations for sample data. In particular, muc h recen t eﬀort has b een fo cused on sp arse line ar r epr esentations . A signal y ∈ R m is said to hav e a sparse representation in terms of a giv en dictionary of basis signals A = [ A 1 , . . . , A n ] ∈ R m × n if y ≈ Ax , where x ∈ R n is a co eﬃcien t v ector with only a few nonzero en tries ( k = k x k 0  n ). This notion of sparsity has emerged in the past 10 years as a dominant idea in signal pro cessing [BDE09]. This is due b oth to the ubiquit y of sparsit y (or near-sparsit y) in practical problems, as w ell as a line of fundamental theoretical results [DE03, F uc04, CT05, ZY06, MY09] that assert that if y is kno wn to b e sparse in a known basis A satisfying certain technical conditions, the sparse co eﬃcien ts x 0 can b e v ery accurately estimated (sometimes p erfectly so!) by solving an ` 1 minimization problem: minimize k x k 1 sub ject to y = Ax . (1.1) These theoretical results allows us to deplo y to ols from sparse signal representation with great conﬁdence: if the signal y has a sparse represen tation, then eﬃcient algorithms are guaranteed to reco ver it. When facing a new class of signals, how ev er, it is not clear how to b egin: what basis A might allo w typical signals y to b e sparsely represen ted? A p opular heuristic is to searc h for a basis A that allo ws a given set of examples Y = [ y 1 , . . . , y p ] ∈ R m × p to be represen ted as compactly as possible. That is, we attempt to solve the following model problem, often referred to as “dictionary learning”: Giv en samples Y = [ y 1 , . . . , y p ] ∈ R m × p all of whic h can b e sparsely represen ted in terms of some unkno wn dictionary A ( Y = AX , for some X with sparse columns), recov er A . A num ber of algorithms hav e b een prop osed for this problem [OF96, EAHH99, KDMR + 03, AEB06, MBPS10] (see the surv ey [RBE10] for a more thorough review). Exploiting sparsity in learned dictionaries has led to practical success in a n um ber of imp ortan t problems in signal acquisition and pro cessing [EA06, BE08, MBP + 08, RS08, YWHM10]. On the other hand, relatively little theory is a v ailable to explain when and why dictionary learning algorithms succeed. There is also little in the w ay of guidelines to tell practitioners when the learned dictionary is exp ected to generalize b ey ond the giv en sample set Y . This stands in contrast to the situation with hand-designed dictionaries, whic h often come with pro ofs of optimalit y for imp ortan t classes of signals. In this pap er, w e tak e a step to w ards closing this gap. W e study a mo del optimization approac h to dictionary learning: minimize k X k 1 sub ject to Y = AX , k A i k 2 = 1 ∀ i. (1.2) Here, k · k 1 denotes the sum of magnitudes, k X k 1 = P ij | X ij | . This optimization problem w as ﬁrst studied b y Grib on v al and Schnass [GS10], as a natural abstraction of p opular dictionary learning algorithms (we will dicsuss the results of [GS10] in more detail in Section 2.1). Notice that while the ob jective function in (1.2) is con v ex, the constrain t is not. Hence, in general it may seem that all w e can hope for is a lo cal optim um. This is a common feature of dictionary learning algorithms. Indeed, it is a classical observ ation in source separation that if we tak e a permutation matrix Π ∈ R n × n and a diagonal matrix of signs Σ , then whenever ( A , X ) solves the abov e problem, so does ( A ΠΣ , ΣΠ ∗ X ). This “sign-p erm utation ambiguit y” implies corresponding to every local minimum of (1.2), there is 2 Basis Size n Sparsity k 10 20 30 40 50 60 2 4 6 8 10 Basis Size n Sparsity k 10 20 30 40 50 60 2 4 6 8 10 Basis Size n Sparsity k 10 20 30 40 50 60 2 4 6 8 10 Figure 1: Phase transitions in dictionary recov ery? W e test whether lo cally minimizing the ` 1 norm correctly recov ers the dictionary A ∈ R m × n and sparse co eﬃcien ts X ∈ R n × p , for v arying sparsit y lev els k and problem size n . Left: m = n . Middle: m = . 8 × n . Right: m = . 6 × n . Here, p = 5 n log( n ). T rials are judged successful if the relativ e error k ˆ A − A k F / k A k F in the recov ered ˆ A is smaller than 10 − 5 . W e av erage ov er 10 trials; white corresp onds to success in all trials, black to failure in all trials. The problems are solved using an algorithm outlined in [GWW11]. a class of 2 n n ! e quiv alent solutions. Moreov er, a-priori there is nothing to preven t the existence of exp onen tially large classes of lo cal minima. This migh t lead one to a dispiriting conclusion: “the problem (1.2) is imp ossible to solv e in general; moreov er, nothing rigorous can b e said ab out its solution.” P art of the goal of this pap er is to disp el such p essimism. Figure 1 shows why there might b e reason for hop e. In it, we solv e v arious s yn thetic instances of the problem (1.2), with v arying problem size and sparsity level. The ﬁgure plots fractions of correct recov eries, for v arious asp ect ratios m/n of A ∈ R m × n . W e observe a v ery in triguing phenomenon: Empirically , optimization algorithms for dictionary learning succeed when the the prob- lem is well-structured ( X is suﬃcien tly sparse), and fail otherwise. Moreov er, in simu- lated examples, the transition b et w een these t wo modes of op eration is fairly sharp. This suggests that, similar to the results for ` 1 -minimization discussed ab o ve, there are imp ortan t classes of dictionary learning problems that can b e solved exactly by eﬃcient (p olynomial time) algorithms. F ully understanding this phenomenon is a long-term goal. Although lo cal optimization ap- proac hes to dictionary learning ha ve rep eatedly demonstrated go o d empirical b eha vior, the afore- men tioned diﬃculties of non-conv exity and sign-p ermutation ambiguit y raise signiﬁcant tec hnical obstacles to developing a theory of their correctness. Nevertheless, a step in this direction was tak en b y Grib onv al and Schnass [GS10], who sho wed that if A is square ( m = n ), then for certain random co eﬃcien t mo dels, the desired solution is indeed a lo cal minim um of the ` 1 -norm with high proba- bilit y . In this paper, we sho w that this is true for a wider range of matrices, including over c omplete dictionaries A with more columns than rows. W e prov e: If the matrix A is appropriately inc oher ent and the co eﬃcien ts X are drawn from a random sparsit y model, then after seeing p olynomially man y samples (sa y , Ω( n 3 )), with high probabilit y the desired solution is indeed lo cally reco v erable. F or non-square matrices, this is the ﬁrst result suggesting that correct reco v ery is p ossible by ` 1 - minimization, even lo cally . Establishing it seems to demand a diﬀerent set of technical to ols and ideas from [GS10]. W e will see that understanding the lo cal prop erties of (1.2) essentially requires us to study a certain equalit y-constrained ` 1 norm minimization problem, which arises by linearizing the nonlinear constrain t in (1.2) at the desired solution ( A ? , X ? ). While ` 1 norm minimization has b een widely studied, and its correctness for reco vering sparse representations in known bases (i.e., 3 problem (1.1)) is increasingly w ell-understo o d, the particular ` 1 minimization problem encoun tered in dictionary learning raises new challenges. In particular, we will see that the linear constrain ts in this problem do not satisfy the Restricted Isometry Prop ert y (RIP) [CT05], a fact whic h signiﬁcan tly complicates their analysis. W e are instead inspired by an analogy to the problem of completing a lo w-rank matrix from an observ ation consisting of a small subset of its entries [CR08], another problem in which the RIP (and analogues) fails. In particular, our analysis is inspired b y the golﬁng sc heme of David Gross [Gro09], which has prov ed useful for a v ariety of problems where the RIP is absent [CLMW09, CP10]. W e also make heavy use of the con venien t and p o w erful op erator Chernoﬀ bounds of Jo el T ropp [T ro10], whose w ork builds on an approac h in troduced by Ahlsw ede and Win ter [A W02]. 1.1 Organization This pap er is organized as follo ws. In Section 2, we describ e in greater detail the mo del studied here, and formally state our main result, and discuss its implications. In particular, in Section 2.1 w e discuss its relationship to existing results. The remainder of the pap er comprises a pro of of this result. Section 3 develops optimalit y conditions, phrased in terms of the existence of a certain dual certiﬁcate. In Section 4, we construct this dual certiﬁcate. The success of the construction relies on a certain balancedness property of the linearized subproblem at the optimum; we formally state and pro ve this property in Section 5. 1.2 Notation F or matrices, X ∗ will denote the transp ose of X . k X k will denote the ` 2 op erator norm. k X k F = p tr[ X ∗ X ] will denote the F rob enius norm. By slight abuse of notation, k X k 1 and k X k ∞ will denote the ` 1 and ` ∞ norms of the matrix, viewed as a large v ector: k X k 1 = X ij | X ij | , k X k ∞ = max ij | X ij | . (1.3) F or vectors x , the notation k x k will mean the ` 2 norm √ x ∗ x . k x k 1 and k x k ∞ will denote the usual ` 1 and ` ∞ norms, respectively . [ n ] denotes the ﬁrst n p ositive integers, { 1 , . . . , n } . The sym b ols e 1 , . . . , e d will denote the d standard basis vectors for R d ; their dimension will b e clear from context. Throughout, the sym b ols C 1 , C 2 , . . . , c 1 , c 2 , . . . refer to numerical constan ts. When used in diﬀerent sections, they need not refer to the same constant. F or a linear subspace V ⊂ R d , w e will let P V ∈ R d × d denote the pro jection matrix onto V . F or a linear subspace V contained in a more general linear space (say , V ⊂ R d × d 0 ), we will let P V denote the pro jection op erator onto this space. W e will slightly abuse notation, and deﬁne, for I ⊆ [ d ], P I to b e the pro jection matrix on to the subspace of vectors supp orted on I ; similarly , for Ω ⊆ [ d ] × [ d 0 ], P Ω : R d × d 0 → R d × d 0 will denote the pro jection op erator on to Ω, which retains the entries indexed b y Ω, and sets the rest to zero. As usual, A ⊗ B denotes the Kroneck er pro duct b et ween matrices A and B . F or B ∈ R a × b , v ec[ B ] ∈ R ab is deﬁned b y stacking B as a vector, column wise. 2 Main Result As describ ed in the previous sec tion, this paper is dedicated to b etter understanding the go o d b eha vior of ` 1 minimization for dictionary learning. In particular, w e would lik e to assert that under natural, easily-satisﬁed conditions, the desired solution can be reco vered, at least lo cally . Of course, whether this is true will dep end strongly on the prop erties of the dictionary A to b e recov ered, as well as the sparse co eﬃcients X that generate our observ ation Y = AX . In this pap er, we restrict our attention to dictionaries A whose columns hav e unit ` 2 norm. W e will adopt the simple 4 assumption that the columns of A are well-spread in the observ ation space R m , i.e., the mutual c oher enc e [DE03] µ ( A ) = max i 6 = j |h A i , A j i| (2.1) is small. Classical results [DE03, GN03, F uc04] show that if A is a (known) dictionary , then ` 1 minimization reco vers an y sparse representation with up to 1 / 2 µ ( A ) nonzeros: k x 0 k 0 < 1 2 (1 + 1 /µ ( A )) = ⇒ x 0 = arg min k x k 1 sub ject to Ax = Ax 0 . (2.2) This result, while p essimistic compared to typical-case b eha vior [CP09], is p o werful b ecause its assumptions on A are reasonable; it do es not seem particularly onerous to assume that µ ( A ) will b e small for learned dictionaries. 2 The next question is how to mo del the sparse co eﬃcients X . In analogy to results in sparse represen tation, w e would like to assert that dictionary learning algorithms function correctly when their assumptions are met, i.e., when the coeﬃcients X are suﬃciently sparse. Ho w ever, it is also clear that b y itself sparsit y of X is not suﬃcient for ( A , X ) to be a lo cal minim um. As a v ery simple example, imagine that there is some i for which all of the X ij are zero. In this pap er, w e assume that the sparsit y pattern of X is random, and that the v alues of the nonzero entries are Gaussian. More precisely , we assume that eac h of the columns x 1 , . . . , x p of X ∈ R n × p is generated iid by ﬁrst c ho osing k out of its n entries uniformly at random to b e nonzero, and letting the magnitude of these nonzero entries b e independent Gaussians with zero mean and common standard deviation σ . The choice of a Gaussian mo del is one of mathematical con v enience; the results in this pap er are easily generalized to wider classes of symmetric distributions. How ever, the assumptions of zero mean and common v ariance are more essential to our analysis. W e can state the abov e mo del more formally as follo ws. W e assume that the observ ations Y = [ y 1 , . . . , y p ] ∈ R m × p are generated iid, y j = Ax j , where x j ∈ R n satisﬁes a Gaussian-random-sparsit y mo del: Ω j ∼ uni  [ n ] k  (2.3) and x j = P Ω j v j , (2.4) where v ij ∼ iid N  0 , σ 2  , σ = p n/k p. (2.5) That is, X = P Ω [ V ], where Ω = { ( i, j ) | j ∈ [ p ] , i ∈ Ω j } is the ov erall supp ort set. The adv an tage to writing X in this manner is that it mak es independence of Ω and V clear. The scaling on v ij pla ys no essen tial role in our proof – the normalization in (2.5) is simply notationally con v enien t b ecause it implies that the sp ectral norm, k X k , is appro ximately one when p is large. In dictionary learning, w e do not observ e A or X , but rather their pro duct Y = AX ∈ R m × p . Corresp onding to this observ ation Y , there exists a manifold of p ossible factorizations M = { ( A , X ) | AX = Y , k A i k 2 = 1 ∀ i } ⊂ R m × n × R n × p . (2.6) In this notation, our model approach (1.2) can b e can be viewed as a nonsmooth optimization o ver this smo oth submanifold: minimize f ( x ) sub ject to x ∈ M . (2.7) Our main result states that if x = ( A , X ) satisﬁes the abov e assumptions, then pro vided the num b er of samples is large enough, with high probability x will be a lo cal minim um of f . More precisely: 2 Conv ersely , there is less a-priori reason to believe that dictionaries encoun tered from sample data will satisfy more powerful assumptions suc h as the RIP . The absence of RIP in A should not, ho w ever, b e confused with the absence of RIP in the lo cal analysis of dictionary learning, which as we will see arises not from the prop erties of A p er se, but rather from the structure of the tangent space to the constraint manifold. 5 Theorem 2.1. Ther e exist numeric al c onstants C 1 , C 2 , C 3 > 0 such that the fol lowing o c curs. If x = ( A , X ) satisfy the pr ob ability mo del (2.3) - (2.5) with k ≤ min { C 1 /µ ( A ) , C 2 n } , (2.8) Then x is a lo c al minimum of the ` 1 norm over M , with pr ob ability at le ast 1 − C 3 k A k 2 n 3 / 2 k 1 / 2 p − 1 / 2 (log p ) . (2.9) This result implies that from polynomially man y samples (sa y p = ω ( n 3 k )), the dictionary learning problem b ecomes lo cally well-posed, i.e., the desired solution b ecomes a lo cal minimum of the ` 1 -norm. One can see that sparsity demanded by Theorem 2.1 mimics that of (2.2). Indeed, this result implies that under essentially the same conditions as the classical b ound for sparse recov ery (2.2), one can (lo cally) reco v er all of the sparse co eﬃcien ts X , as w ell as the sparsifying basis A . 2.1 Comparison to existing results As mentioned in the introduction, the theory of dictionary learning is only b eginning to dev elop. The most direct p oin t of comparison for our result is the v ery nice pap er of Gribonv al and Schnass [GS10] (henceforth “G-S”). That work prop osed to study the optimization (1.2), and developed conditions for a giv en solution x = ( A , X ) to b e a lo cal minimum. These conditions essentially demand that x be optimal o ver the tangen t space to the constraint manifold at x . While w e do not directly use the optimality conditions of G-S, the dualit y condition that we base our approac h on is essen tially equiv alent. Ho w ever, the subsequent analysis uses a completely diﬀeren t set of to ols and approac hes. Aside from dev eloping optimalit y conditions, the ma jor contribution of [GS10] is a probabilistic analysis of the case when A ∈ R n × n is square and the co eﬃcien ts X are iid Bernoulli-Gaussian, i.e., eac h X ij is nonzero with probability ρ , and the nonzero entries are conditionally Gaussian. Using argumen ts from geometry and concen tration of measure, G-S show in this situation ( A , X ) is a lo cal optim um with high probability pro vided p = Ω( n log n/ρ ). Our Theorem 2.1 is more general, since it encompasses cases where A is nonsquare (i.e., an o vercomplete dictionary). How ever, the n umber of samples stipulated b y our b ound is larger. Indeed, if we take k = O (1), and set ρ = k /n for purp oses of comparison, then for square matrices, G-S’s result guarantees correct reco v ery from n 2 log n samples. Our result requires at least n 3 samples, but applies to general matrices. It is p ossible that the gap b et w een the t w o orders of growth migh t b e further closed with a more reﬁned analysis of the construction proposed in this pap er. 2.2 Discussion While we ﬁnd these results quite encouraging, there is still muc h to do. In fact, there remains a w ealth of fascinating op en problems just in volving the linearized subproblem. One natural question is whether the assumption of hard sparsit y in X can b e relaxed to a Bernoull i-Gaussian mo del, with similar probabilit y of eac h coeﬃcient being nonzero; i.e., ρ ≈ k /n . In this case, care will need to b e tak en b ecause a small n umber of columns of X may b e so dense as to not b e optimal. Ho w ev er, w e see no essential obstacle to extending the approach used here to deal with this case. Another, more diﬃcult question, is what will happ en if the num be r of nonzero en tries dramatically exceeds C 1 /µ ( A ). In this case, again, many of the individual columns of X may be sub optimal, but it is still lik ely that the basis A is a lo cal minim um. W e b eliev e that the golﬁng scheme of Section 4 will again pro vide a relev ant to ol. Ho wev er, more work will need to b e done to ensure that the balancedness condition in Theorem 5.1 still holds. Even more in teresting from an application p ersp ectiv e would b e to show that noise do es not signiﬁcan tly aﬀect the lo cal optimalit y of the desired solution x ? . The framew ork of Negahban and collaborators may be relev an t here [NR WY09]. 6 3 Lo cal Prop erties and the Linearized Subproblem As we saw in the previous section, our main result concerns the lo cal optimality of the desired solution x = ( A , X ) ov er the smo oth submanifold M ⊂ R m × n × R n × p . A k ey role in this result will be play ed b y the tangen t space T x M to M at x , whic h can b e iden tiﬁed 3 with the space of all p erturbations ( ∆ A , ∆ X ) ∈ R m × n × R n × p satisfying ∆ A X + A ∆ X = 0 , h A i , ∆ A i i = 0 ∀ i. (3.1) The ﬁrst equation comes from diﬀerentiating the bilinear constraint Y = AX , while the second comes from diﬀeren tiating the constraint k A i k 2 = 1. In tuitively , we migh t hop e to study the lo cal prop erties of f by studying how it b eha v es on the tangen t space at x . Replacing M with its linearization about x yields the following optimization problem: minimize f ( x + δ ) sub ject to δ ∈ T x M . (3.2) Using the ab o v e c haracterization of T x M , this can b e written a bit more concretely as minimize k X + ∆ X k 1 sub ject to ∆ A X + A ∆ X = 0 , h A i , ∆ A i i = 0 , ∀ i. (3.3) This linearized subproblem is conv ex. In particular, it is easy to see that under an appropriate c hange of v ariables, it is equiv alen t to an equality constrained ` 1 minimization problem, minimize k z k 1 sub ject to B z = B z 0 . (3.4) This should give us reason for optimism: as alluded to in the in tro duction, a great deal of eﬀort has gone into developing technical to ols for understanding the solutions to ` 1 -minimization problems. The follo wing lemma tells us that in order to determine if x is a lo cal minimum, it is enough to ask whether δ = 0 is the unique optimal solution to the linearized subproblem (3.2): Lemma 3.1. Supp ose that x ∈ M is such that δ = 0 is the unique optimal solution to (3.2) . Then x is a lo c al minimum of the function f ( · ) over M . Conversely, if x is a lo c al minimum, then δ = 0 is an optimal solution to (3.2) . Pr o of. Please see App endix D. W e will prov e our main result, Theorem 2.1 by sho wing that under the stated conditions the zero p erturbation ( ∆ A , ∆ X ) = ( 0 , 0 ) is indeed the unique optimal solution to (3.3). T o do so, we need to study an equalit y constrained ` 1 -minimization problem of the same form as (3.4). In the absence of sp eciﬁc assumptions on the distribution of B (such as Gaussianity [DT09]), the dominant to ol for doing this is the Restricted Isometry Prop ert y (RIP), which holds with order k and constant 0 ≤ δ < 1 if (1 − δ ) k z k 2 ≤ k B z k 2 ≤ (1 + δ ) k z k 2 ∀ z such that k z k 0 ≤ k . (3.5) When the RIP holds (with appropriate k , δ ), the ` 1 -minimization (3.4) recov ers an y suﬃcien tly sparse z 0 , and noise-a w are v ersions perform stably [Can08]. Th us, if w e could sho w that the equalit y constrain ts in (3.3) satisfy an appropriate RIP v arian t, w e w ould b e done. Unfortunately , this is not the case: the RIP fails for our pr oblem of inter est . W e sk etc h why this is true. At a high level, the RIP states that the op erator B resp ects the geometry of all sparse v ectors; in particular, there are no sparse vectors near the nullspace of B . In our case, B is sp eciﬁed 3 In this pap er, our space of optimization M is most naturally sp eciﬁed as a submanifold of R D . W e will commit sins of notation such as identifying the tangent space T x M with a particular vector subspace of R D , and o ccasionally writing x + δ ∈ R D , where x ∈ M and δ ∈ T x M . Given the relatively small role play ed by the intrinsic Riemannian structure of M in the pap er, we b eliev e these simpliﬁcations are justiﬁed. 7 b y the equality constrain ts in (3.3). T ake any p erm utation matrix Π ∈ R n × n with no ﬁxed point, and set ∆ A = − A Π , ∆ X = Π X . (3.6) Then, it is easy to see that ∆ A X + A ∆ X = 0 . Moreov er, for eac h i , h A i , ∆ A i i = −h A i , A π ( i ) i ≈ 0 , (3.7) whic h follo ws because π ( i ) 6 = i and A has incoherent columns. Th us, w e ha ve constructed a p ertur- bation ( ∆ A , ∆ X ) that lies v ery near the nullspace of B , and such that ∆ X has exactly the same sparsit y as the desired solution X . In fact, not only do es the RIP not hold, but structured v arian ts (for example, restricting to matrices ∆ X with sp arse c olumns , rather than general sparse matrices) also fail. W e make this in tuitiv e argumen t more precise in Section A of the App endix. This leav es us in a situation with less in common with compressed sensing, and m uc h more in common with the diﬃcult problem of matrix c ompletion [CR08]. In that problem, w e are shown a small set of entries Q i 1 ,j 1 , . . . , Q i p ,j p of an unkno wn low-rank matrix Q . The goal is to ﬁll in the missing v alues. There, the natural analogue of the RIP also fails, since the sampling op erator com- pletely misses some low-rank matrices (for example, those consisting of a single nonzero entry that is not in the observed set). This fact signiﬁcan tly complicates analysis [CT09]. Motiv ated by appli- cations in quantum information theory , recen t papers b y Gross and collaborators ha v e in tro duced a n umber of technical to ols that signiﬁcantly ease the analysis of matrix completion [Gro09], allo wing them to derive near-optimal recov ery guaran tees in a clear and simple manner. Moreov er, the ideas of [Gro09] appear to be useful in a v ariet y of settings b ey ond matrix completion [CLMW09, CP10]. In this paper, w e use similar pro of techniques to analyze the linearized subproblem (3.3). While the details necessarily diﬀer quite a bit from Gross’s work, our inspiration is very muc h the success of these to ols in other non-RIP settings. T o describ e this sc heme in more detail, how ever, it is easiest to start at the very beginning. W e wish to establish that ( 0 , 0 ) is optimal for (3.3). T o do so, we recall the KKT conditions for this problem, whic h imply that ( 0 , 0 ) is optimal if and only if there exist tw o dual v ariables, a matrix Λ ∈ R m × p (corresp onding to the constraint ∆ A X + A ∆ X = 0 ) and a diagonal matrix Γ ∈ R n × n (corresp onding to the constrain t h A i , ∆ A i i = 0) satisfying A ∗ Λ ∈ ∂ k · k 1 ( X ) (3.8) Λ X ∗ = A Γ . (3.9) The interested reader can easily derive these conditions; w e will pro vide a rigorous pro of of a more useful v arian t below in Lemma 3.2. The ﬁrst constraint simply asserts that eac h column x j of X is the minim um ` 1 norm solution to Ax = y j . Indeed, writing Ω = support( X ) and Σ = sign( X ), w e recall that ∂ k · k 1 ( X ) = { Σ + W | P Ω [ W ] = 0 , k W k ∞ ≤ 1 } . (3.10) Then, (3.8) holds if and only if ∃ w 1 , . . . , w p ∈ R m suc h that A ∗ λ j = Σ j + w j , P Ω j w j = 0 , k w j k ∞ ≤ 1 . (3.11) This constraint is quite familiar from ` 1 -minimization: duality , and in particular the construction of dual certiﬁcates λ j pla ys a crucial role in a num ber of works on the correctness of ` 1 minimization [F uc04, CT05, WM10]. On the other hand, the second constraint (3.9) is less familiar. It essentially asserts that locally w e cannot impro v e our situation b y changing the basis A . Notice that it demands that eac h column of Λ X ∗ is prop ortional to the corresp onding column of A ; we ﬁnd it conv enien t to introduce an op erator Φ : R m × n → R m × n that pro jects each column onto the orthogonal complement of the corresp onding column of A : Φ[ M ] = [( I − A 1 A ∗ 1 ) M 1 | · · · | ( I − A n A ∗ n ) M n ] , (3.12) 8 giving an equiv alen t constrain t Φ[ Λ X ∗ ] = 0 . This constrain t still places demands on all of the dual v ectors λ j sim ultaneously , making it p otentially more diﬃcult to satisfy than (3.8). In the follo wing lemma, we trade oﬀ b et ween the t wo constraints, sho wing that if w e tighten our demands on (3.8), we can correspondingly lo osen the demand on (3.9): Lemma 3.2. L et A b e a matrix with no k -sp arse ve ctors in its nul lsp ac e. Supp ose that ther e exists α > 0 such that for al l p airs ( ∆ A , ∆ X ) satisfying (3.1) , kP Ω c ∆ X k F ≥ α k ∆ A k F . (3.13) Then if ther e exists Λ ∈ R m × p such that P Ω [ A ∗ Λ ] = Σ , kP Ω c [ A ∗ Λ ] k ∞ ≤ 1 / 2 , (3.14) and k Φ[ Λ X ∗ ] k F < α 2 , (3.15) we c onclude that ( ∆ A , ∆ X ) = ( 0 , 0 ) is the unique optimal solution to (3.3) . Pr o of. Consider any feasible ( ∆ A , ∆ X ). Cho ose H ∈ ∂ k·k 1 ( X ) suc h that h H , P Ω c ∆ X i = kP Ω c ∆ X k 1 , and notice that P Ω H = Σ . Then k X + ∆ X k 1 ≥ k X k 1 + h H , ∆ X i . (3.16) Notice that since ( ∆ A , ∆ X ) is feasible, h ∆ A , Λ X ∗ i + h ∆ X , A ∗ Λ i = h ∆ A X , Λ i + h A ∆ X , Λ i = h ∆ A X + A ∆ X , Λ i = h 0 , Λ i = 0 . Hence, k X + ∆ X k 1 ≥ k X k 1 + h H , ∆ X i − h A ∗ Λ , ∆ X i − h Λ X ∗ , ∆ A i (3.17) = k X k 1 + h H − A ∗ Λ , ∆ X i − h Λ X ∗ , ∆ A i (3.18) = k X k 1 + hP Ω [ H − A ∗ Λ ] , P Ω ∆ X i + hP Ω c [ H − A ∗ Λ ] , P Ω c ∆ X i − h Φ[ Λ X ∗ ] , ∆ A i (3.19) = k X k 1 + hP Ω c [ H − A ∗ Λ ] , P Ω c ∆ X i − h Φ[ Λ X ∗ ] , ∆ A i (3.20) ≥ k X k 1 + kP Ω c ∆ X k 1 / 2 − k ∆ A k F k Φ[ Λ X ∗ ] k F (3.21) ≥ k X k 1 +  1 2 − α − 1 k Φ[ Λ X ∗ ] k F  kP Ω c ∆ X k 1 . (3.22) In (3.19), we hav e used the fact that since ∆ A is feasible, each column of ∆ A is orthogonal to the corresponding column of A , and so Φ[ ∆ A ] = ∆ A . F urthermore, it is easily veriﬁed that Φ is self-adjoin t, and so h Λ X ∗ , Φ[ ∆ A ] i = h Φ[ Λ X ∗ ] , ∆ A i . In (3.20), w e hav e used that since H ∈ ∂ k · k 1 , P Ω H = Σ = P Ω [ A ∗ Λ ]. The righ t hand side of (3.22) is strictly greater than k X k 1 pro vided that (i) k Φ[ Λ X ∗ ] k F < α/ 2 and (ii) P Ω c ∆ X 6 = 0 . The assumptions (3.13) and our assumption on the nullspace of A imply (ii). The remainder of the argumen t will show that the hypotheses of this lemma indeed hold. In Section 4, we give a construction of a dual matrix Λ that alw a ys satisﬁes (3.14), and satisﬁes (3.15) with high probability , pro vided p is large enough. This is the con ten t of Theorem 4.1. In Section 5, we show that with high probability the balancedness prop erty (3.13) indeed holds with nonzero α (in particular, we can take α = C / k A k 2 ). This is done in Theorem 5.1. Combining these tw o results with Lemma 3.2 completes the proof of Theorem 2.1. The pro ofs of these key lemmas mak e rep eated use of b ounds on singular v alues of submatrices of an incoheren t matrix. F or completeness, w e assemble these required results in Appendix C. 9 4 Certiﬁcation Pro cess In this section, w e show how to construct the dual certiﬁcate demanded by Lemma 3.2. In particular, w e prov e the following result: Theorem 4.1. Ther e exist numeric al c onstants C 1 , C 2 , C 3 > 0 such that the fol lowing o c curs. Whenever k ≤ min  C 1 µ ( A ) , C 2 n  , (4.1) then for any α > 0 , ther e exists Λ ∈ R m × p simultane ously satisfying the fol lowing thr e e pr op erties: P Ω [ A ∗ Λ ] = Σ , (4.2) kP Ω c [ A ∗ Λ ] k ∞ ≤ 1 / 2 , (4.3) k Φ [ Λ X ∗ ] k F < α/ 2 , (4.4) with pr ob ability at le ast 1 − C 3 α − 1 n 3 / 2 k 1 / 2 p − 1 / 2 (log p ) . (4.5) 4.1 First Steps The follo wing lemma go es most of the w a y to establishing Theorem 4.1: Lemma 4.2. Fix any p > 0 and let x 1 , . . . , x p b e indep endent and identic al ly distribute d r andom ve ctors with x j = P Ω j v j , wher e the Ω j ⊂ [ n ] ar e uniform r andom subsets of size k and v j ∼ iid N (0 , n/k p ) . Then ther e exists a p ositive inte ger t ? ∈ [ b ( p − 1) / 2 c , p ] and a se quenc e of r andom ve ctors λ 1 , . . . , λ t ? dep ending only on x 1 , . . . , x t ? such that P Ω j A ∗ λ j = sign( x j ) , j = 1 , . . . , t ? , (4.6) k P Ω c j A ∗ λ j k ∞ ≤ 1 / 2 , j = 1 , . . . , t ? (4.7) E h    Φ[ t ? X j =1 λ j x ∗ j ]    F i ≤ C n 3 / 2 k 1 / 2 p − 1 / 2 , (4.8) wher e C is numeric al. Section 4.2 prov es Lemma 4.2 by giving an explicit construction of the desired dual certiﬁcates. Before describing this construction in greater detail, we ﬁrst sho w that Theorem 4.1 follo ws as an easy consequence of Lemma 4.2, by dividing the sample set in to subsets and then applying Lemma 4.2 to eac h subset. Pr o of of The or em 4.1. Cho ose t 1 according to Lemma 4.2, and let λ 1 , . . . , λ t 1 b e the corresponding (random) dual v ectors indicated by Lemma 4.2. Then E h    Φ[ t 1 X j =1 λ j x ∗ j ]    F i ≤ C n 3 / 2 k 1 / 2 p − 1 / 2 . (4.9) Moreo ver, unless p < 3, p − t 1 ≤ 3 p/ 4. Notice that the iid random vectors  p p − t 1  1 / 2 x t 1 +1 , . . . ,  p p − t 1  1 / 2 x p (4.10) 10 again satisfy the h yp otheses of Lemma 4.2. Hence, there exists δ ∈ [ b ( p − t 1 − 1) / 2 c , p − t 1 ] and corresp onding certiﬁcates λ t 1 +1 , . . . , λ t 1 + δ , again satisfying (4.6)-(4.7), suc h that if we set t 2 = t 1 + δ , w e hav e E h    Φ[ t 2 X j = t 1 +1 λ j x ∗ j ]    F i ≤  p − t 1 p  1 / 2 C n 3 / 2 k 1 / 2 ( p − t 1 ) − 1 / 2 = C n 3 / 2 k 1 / 2 p − 1 / 2 . (4.11) This leav es at most p − t 2 ≤ max((3 / 4) 2 p, 2) vectors x t 2 +1 , . . . , x p to b e certiﬁed. Repeating this construction O (log p ) times yields a sequence of dual certiﬁcates λ 1 , . . . , λ p satisfying (4.6)-(4.7), with E h    Φ[ p X j =1 λ j x ∗ j ]    F i ≤ C 0 (log p ) n 3 / 2 k 1 / 2 p − 1 / 2 . The desired probabilit y estimate follows from the Mark o v inequalit y . 4.2 The construction In this section, w e outline a pro cess that constructs the random sequence of certiﬁcates λ 1 , . . . , λ t ? describ ed in Lemma 4.2. W e will describ e a construction of p certiﬁcates λ 1 , . . . , λ p , and then c ho ose t ? ∈ [ b ( p − 1) / 2 c , p ] according to our analysis of this construction. Recall that we hav e deﬁned Ω j = supp ort( x j ) ⊂ [ n ]. Below, w e will use Θ j ∈ R m × m to denote the orthopro jector onto the orthogonal complemen t of the range of A Ω j : Θ j = I − A Ω j ( A ∗ Ω j A Ω j ) − 1 A ∗ Ω j . (4.12) W e will let Q j denote the residual at time j : Q j . = j X l =1 Φ[ λ l x ∗ l ] . (4.13) As ab o ve, let σ j = sign( x j (Ω j )) ∈ {± 1 } k . Set ζ j = ( 1 4 Θ j Q j − 1 x j k Θ j Q j − 1 x j k Θ j Q j − 1 x j 6 = 0 0 else (4.14) λ LS j = A Ω j ( A ∗ Ω j A Ω j ) − 1 σ j , (4.15) λ j = λ LS j − ζ j , (4.16) While it app ears complicated, the rationale for the ab o v e pro cedure is actually quite simple. A t eac h step j , we construct a certiﬁcate λ j ∈ R m . W e would lik e to make Q j = Φ[ P j l =1 λ j x ∗ j ] as small as possible, while still respecting the constraints A ∗ Ω j λ j = σ j , k A ∗ Ω c j λ j k ∞ ≤ 1 / 2. The ﬁrst term, λ LS j serv es to ensure that the certiﬁcation constraints are met. Notice that since A ∗ Ω j Θ j = 0 , A ∗ Ω j λ j = A ∗ Ω j ( λ LS j − ζ j ) = A ∗ Ω j λ LS j = σ j . (4.17) Moreo ver, for eac h i / ∈ Ω j , | A ∗ i λ LS j | = | A ∗ i A Ω j ( A ∗ Ω j A Ω j ) − 1 σ j | ≤ k A ∗ i A Ω j k 2 k ( A ∗ Ω j A Ω j ) − 1 kk σ j k 2 . (4.18) Since σ j ∈ {± 1 } k , k σ j k 2 = √ k . Under the assumption k µ ( A ) < 1 / 2, a standard argument (repeated as (C.4) of Appendix C) sho ws that k ( A ∗ Ω j A Ω j ) − 1 k ≤ 2. Finally , since A ∗ i A Ω j is a v ector of length k with entries bounded by µ ( A ), k A ∗ i A Ω j k 2 ≤ µ ( A ) √ k . Combining bounds, we ha v e k A ∗ Ω c j λ LS j k ∞ = max i / ∈ Ω j | A ∗ i λ LS j | ≤ 2 k µ ( A ) . (4.19) 11 Hence, further assuming k µ ( A ) < 1 / 8, we obtain k A ∗ Ω c j λ LS j k ∞ ≤ 1 / 4 and    A ∗ Ω c j λ j    ∞ ≤    A ∗ Ω c j λ LS j    ∞ +    A ∗ Ω c j ζ j    ∞ ≤ 1 4 + max i k A i k 2 k ζ j k 2 ≤ 1 2 . (4.20) The choice of ζ j is designed to deﬂate the residual Q as muc h as p ossible. 4 As we will see in the pro of of Theorem 4.1, this process do es succeed in controlling the norm of the residual Q j . 4.3 Analysis The next question is how to analyze the order of gro wth of k Q j k F , as a function of the matrix A and the sparsit y k . T o b e more formal, let F j b e the σ -algebra generated by Ω 1 , . . . , Ω j and v 1 , . . . , v j ; F 1 ⊂ F 2 ⊂ · · · ⊂ F p is the natural ﬁltration. W e will o ccasionally use the notation E Ω j [ f (Ω , V )] to denote the expectation o ver Ω j , with all other v ariables ﬁxed. More precisely , E Ω j [ f (Ω , V )] . = E [ f (Ω , V ) | σ (Ω 1 , . . . , Ω j − 1 , Ω j +1 , . . . , Ω j , V )] . (4.21) W e will use a similar notation E v j [ f (Ω , V )] for the exp ectation o ver v j with all other v ariables ﬁxed. With these notations in mind, we em bark on the pro of of Lemma 4.2. Pr o of of L emma 4.2. W e b egin b y using Q j = Q j − 1 + Φ[ λ j x ∗ j ] to write E  k Q j k 2 F | F j − 1  = k Q j − 1 k 2 F + 2 E  Q j − 1 , Φ[ λ j x ∗ j ]  | F j − 1  + E  k Φ[ λ j x ∗ j ] k 2 F | F j − 1  (4.22) W e will show that there exists ε ( p ) > 0 and τ ( p ) suc h that E  Q j − 1 , Φ[ λ j x ∗ j ]  | F j − 1  ≤ − ε ( p ) × k Q j − 1 k F , (4.23) E  k Φ[ λ j x ∗ j ] k 2 F | F j − 1  ≤ τ ( p ) . (4.24) Plugging in to (4.22) and taking the exp ectation of both sides yields E [ k Q j k 2 F ] ≤ E [ k Q j − 1 k 2 F ] − 2 ε ( p ) E [ k Q j − 1 k F ] + τ ( p ) . (4.25) Summing from j = 1 , . . . , p and using that Q 0 = 0 , w e hav e E [ k Q p k 2 F ] ≤ pτ ( p ) − 2 ε ( p ) p − 1 X j =1 E [ k Q j k F ] (4.26) In paragraphs (i)-(iii) b elo w, w e sho w that the quantities ε and τ satisfy the following bounds: ε ( p ) ≥ C 1 p k /np, and τ ( p ) ≤ C 2 nk /p. (4.27) F or now, taking these bounds as given, w e observ e that by (4.26) E [ k Q 1 k F ] ≤ ( E [ k Q 1 k 2 F ]) 1 / 2 ≤ p τ ( p ) , (4.28) 4 Indeed, ζ j is a scaled version of a solution to the optimization problem minimize k Q j − 1 + ζ x ∗ j k F sub ject to A ∗ Ω j ζ = 0 . 12 and hence the claim of Lemma 4.2 is veriﬁed in the case p = 1. On the other hand, if p > 1, then using the fact that the left hand side of (4.26) is nonnegativ e, we ha ve 1 p − 1 p − 1 X j =1 E [ k Q j k F ] ≤ p p − 1 τ ( p ) 2 ε ( p ) ≤ τ ( p ) /ε ( p ) . (4.29) W e recognize the left hand side of this inequality as an a v erage. By the Mark ov inequality , if we w ere to choose an index t ∈ { 1 , . . . , p − 1 } uniformly at random, then with probabilit y at least 1 / 2, E [ k Q t k F ] ≤ 2 τ ( p ) /ε ( p ). In particular, since [ b ( p − 1) / 2 c , p ] con tains more than half the elements of { 1 , . . . , p − 1 } , there exists at least one t ? ∈ [ b ( p − 1) / 2 c , p ] suc h that E [ k Q t ? k F ] ≤ 2 τ ( p ) /ε ( p ) . (4.30) Plugging in the b ounds from (4.27) establishes Lemma 4.2. All that remains to do is show that the b ounds in (4.27) indeed hold. W e establish the b ound on ε in paragraphs (i)-(ii) b elo w, using the con venien t split E  h Q j − 1 , λ j x ∗ j i | F j − 1  = E h h Q j − 1 , λ LS j x ∗ j i | F j − 1 i − E  h Q j − 1 , ζ j x ∗ j i | F j − 1  (4.31) Finally , in paragraph (iii) below, w e establish the bound on τ claimed in (4.27), completing the proof of Lemma 4.2. (i) Upp er b ounding h Q j − 1 , λ LS j x ∗ j i . F or Ω j = { a 1 < a 2 < · · · < a k } , set U Ω j . = [ e a 1 | e a 2 | · · · | e a k ] ∈ R n × k , so that P Ω j = U Ω j U ∗ Ω j . Notice that we can write D Q j − 1 , λ LS j x ∗ j E = D Q j − 1 , A Ω j ( A ∗ Ω j A Ω j ) − 1 U ∗ Ω j sgn( v j ) v ∗ j P Ω j E . (4.32) Then, using that E [sgn( v j ) v ∗ j ] = c 1 σ I , we ha ve E hD Q j − 1 , λ LS j x ∗ j E | F j − 1 i = E Ω j E v j hD Q j − 1 , A Ω j ( A ∗ Ω j A Ω j ) − 1 U ∗ Ω j sgn( v j ) v ∗ j P Ω j Ei = c 1 σ E Ω j hD Q j − 1 , A Ω j ( A ∗ Ω j A Ω j ) − 1 U ∗ Ω j Ei . (4.33) W rite ( A ∗ Ω j A Ω j ) − 1 = I + ∆ (Ω j ). Then, we ha v e D Q j − 1 , A Ω j ( A ∗ Ω j A Ω j ) − 1 U ∗ Ω j E = D Q j − 1 P Ω j , A Ω j ( A ∗ Ω j A Ω j ) − 1 U ∗ Ω j E = D Q j − 1 P Ω j , A Ω j U ∗ Ω j E + D Q j − 1 P Ω j , A Ω j ∆ (Ω j ) U ∗ Ω j E . Since Q j − 1 = Φ[ P j − 1 l =0 λ l x ∗ l ] ∈ range(Φ), each of the columns of Q j − 1 ∈ R m × n is orthogonal to the corresp onding column of A . Since the ﬁrst inner pro duct in the ab o ve equation is simply the inner pro duct of the restriction of A to a subset of its columns and the restriction of Q j − 1 to a subset of its columns, this term is zero. Applying the Cauch y-Sch w arz inequality to the second term of the previous equation giv es D Q j − 1 , A Ω j ( A ∗ Ω j A Ω j ) − 1 U ∗ Ω j E ≤ k Q j − 1 P Ω j k F k A Ω j kk ∆ (Ω j ) k F . (4.34) Standard calculations, given in (C.2) of App endix C sho w that k A Ω j k ≤ (1 + k µ ( A )) 1 / 2 is bounded b y a constant, say , c 2 . A similar calculation in (C.6) shows that k ∆ (Ω j ) k F ≤ 2 k µ ( A ). Plugging bac k into (4.33), w e hav e E hD Q j − 1 , λ LS j x ∗ j E | F j − 1 i ≤ 2 c 1 c 2 σ kµ ( A ) E Ω j [ k Q j − 1 P Ω j k F ] . (4.35) F or now, w e will be conten t with this expression. 13 (ii) Lo w er b ounding h Q j − 1 , ζ j x ∗ j i . Con tinuing, w e hav e that h Q j − 1 , ζ j x ∗ j i = h Q j − 1 x j , ζ j i (4.36) = 1 4  Q j − 1 x j , Θ j Q j − 1 x j k Θ j Q j − 1 x j k  (4.37) = 1 4   Θ j Q j − 1 x j   (4.38) ≥ 1 4   Q j − 1 x j   − 1 4    P A Ω j Q j − 1 x j    , (4.39) where P A Ω j = A Ω j ( A ∗ Ω j A Ω j ) − 1 A ∗ Ω j ∈ R m × m . F or the ﬁrst term of (4.39), applying the Kahane- Khin tchine inequalit y in Corollary B.3, gives E    Q j − 1 x j   | F j − 1  = E Ω j E v j  k Q j − 1 P Ω j v j k  (4.40) ≥ σ √ π × E Ω j  k Q j − 1 P Ω j k F  . (4.41) F or the second term of (4.39), writing P A Ω j = A Ω j ( A ∗ Ω j A Ω j ) − 1 / 2 × ( A ∗ Ω j A Ω j ) − 1 / 2 A ∗ Ω j as a pro duct of matrices with sp ectral norm one, w e ha v e    P A Ω j Q j − 1 x j    =    ( A ∗ Ω j A Ω j ) − 1 / 2 A ∗ Ω j Q j − 1 P Ω j v j    (4.42) ≤    ( A ∗ Ω j A Ω j ) − 1 / 2       A ∗ Ω j Q j − 1 P Ω j v j    . (4.43) A calculation sho ws that under the assumption k µ ( A ) < 1 / 2, k ( A ∗ Ω j A Ω j ) − 1 / 2 k ≤ √ 2, and so    P A Ω j Q j − 1 x j    ≤ √ 2 ×    A ∗ Ω j Q j − 1 P Ω j v j    = √ 2 ×   P Ω j A ∗ Q j − 1 P Ω j v j   . (4.44) Applying the Jensen’s inequality to bound the expectation of the ab o v e expression, we ha v e (via Corollary B.3), E h    P A Ω j Q j − 1 x j    | F j − 1 i ≤ √ 2 × E    P Ω j A ∗ Q j − 1 P Ω j v j   | F j − 1  (4.45) = √ 2 × E Ω j E v j    P Ω j A ∗ Q j − 1 P Ω j v j    (4.46) ≤ σ √ 2 × E Ω j h   P Ω j A ∗ Q j − 1 P Ω j   F i . (4.47) Notice that b ecause each column of Q j − 1 is orthogonal to the corresp onding column of A , the diagonal elements of A ∗ Q j − 1 are zero. Under this condition, we can in vok e a decoupling lemma giv en as Lemma E.1 to remov e the ﬁrst P Ω j , giving E Ω j h   P Ω j A ∗ Q j − 1 P Ω j   F i ≤ 16 r k n E Ω j h   A ∗ Q j − 1 P Ω j   F i ≤ 16 k A k r k n E Ω j h   Q j − 1 P Ω j   F i . Via incoherence, w e can show that k A k ≤ p 1 + nµ ( A ) (see (C.1)), and so E Ω j h   P Ω j A ∗ Q j − 1 P Ω j   F i ≤ c 3 p k /n + k µ ( A ) E Ω j h   Q j − 1 P Ω j   F i , (4.48) for appropriate c 3 . Combining bounds, we ha v e shown that E  h Q j − 1 , λ j x ∗ j i | F j − 1  ≤ σ  2 c 1 c 2 k µ ( A ) + c 3 4 p k /n + k µ ( A ) − 1 4 √ π  E Ω j h   Q j − 1 P Ω j   F i . 14 Assuming k /n and kµ ( A ) are b ounded b elo w appropriately small constan ts, we ha ve E  h Q j − 1 , λ j x ∗ j i | F j − 1  ≤ − c 4 σ E Ω j h   Q j − 1 P Ω j   F i ≤ − c 4 σ   E Ω j  Q j − 1 P Ω j    F ≤ − c 4 σ ( k/n ) k Q j − 1 k F = − c 4 p k /np k Q j − 1 k F , where w e hav e used Jensen’s inequality and the facts that E Ω j [ P Ω j ] = ( k /n ) I and σ = p n/k p . This establishes the ﬁrst part of (4.27). (iii) Bounding k λ j x ∗ j k . W e next b ound E h   Φ[ λ j x ∗ j ]   2 F | F j − 1 i . W e hav e already sho wn that under the conditions of the lemma, k λ j k 2 ≤ c 5 √ k + 1 / 4 ≤ c 6 √ k . So,   Φ  λ j x ∗ j    2 F ≤   λ j x ∗ j   2 F = k λ j k 2 k x j k 2 ≤ c 6 k k x j k 2 . (4.49) Since E [ k x j k 2 2 ] = n/p , w e hav e the simple b ound E h   Φ[ λ j x ∗ j ]   2 F | F j − 1 i ≤ c 6 k n/p. (4.50) This establishes the second part of (4.27), completing the pro of of Lemma 4.2. 5 Balancedness Prop ert y In this section, we sho w that for any ( ∆ A , ∆ X ) in the tangent space to M at ( A , X ), kP Ω c ∆ X k F ≥ α k ∆ A k F (5.1) for appropriate α > 0. This prop ert y essen tially says that if w e lo cally p erturb the basis, we are guaran teed to pay some penalty , in terms of the norm of P Ω c ∆ X . Hence, it can b e viewed as a step in the direction of Theorem 2.1. By itself, it is not suﬃcient to establish Theorem 2.1, how ever, since it do es not rule out the p ossibilit y that as A c hanges, kP Ω ∆ X k 1 migh t decrease faster than kP Ω c ∆ X k 1 increases – for this purpose we need the golﬁng scheme of the previous section. On a tec hnical lev el, how ev er, (5.1) makes the golﬁng sc heme p ossible, b y allo wing us to op en a “hole” around the constraint Φ[ Λ X ∗ ] = 0 , and construct dual certiﬁcates Λ that only satisfy Φ[ Λ X ∗ ] ≈ 0 . More precisely , we next sho w that Theorem 5.1. Ther e exist numeric al c onstants C 1 , . . . , C 8 > 0 such that the fol lowing o c curs. If k ≤ C 1 × min  n, 1 µ ( A )  , (5.2) then whenever p ≥ C 2 n 2 , with pr ob ability at le ast 1 − C 3 p − 4 − C 4 n exp  − C 5 p n log p  − C 6 n 2 exp  − C 7 k 2 p n 2  , (5.3) al l p airs ( ∆ A , ∆ X ) satisfying (3.1) ob ey the estimate kP Ω c ∆ X k F ≥ C 8 k ∆ A k F / k A k 2 . (5.4) 15 Organization. The pro of of Theorem 5.1 contains essen tially t w o parts: algebraic manipulations that show that the desired prop ert y holds whenever the random matrix X satisﬁes t w o particular prop erties, and then probabilistic reasoning to show that these properties hold with the stated prob- abilit y . The ﬁrst prop erty , stated in Lemma 5.2, simply in volv es a b ound on the extreme eigenv alues of X X ∗ . This lemma is prov ed in App endix F. The second probabilistic property in v olv es con trol- ling the diﬀerence b et w een a certain op erator and its large sample limit. The quantities in v olv ed will arise naturally in the pro of of Theorem 5.1, and the claim will b e formally stated in Lemma 5.3 b elo w. The pro of of Lemma 5.3 is a bit tec hnical, requiring us to apply the matrix Chernoﬀ bound conditional on Ω. This pro of is giv en in Section 6. 5.1 Pro of of Theorem 5.1. Before commencing the pro of of Theorem 5.1, w e introduce one additional deﬁnition. Fix 0 < t < 1 / 2, and let E eig ( t ) denote the even t: E eig ( t ) . = { ω | k X X ∗ − I k < t } (5.5) In particular, on E eig , k X X ∗ k < 1 + t < 2, k ( X X ∗ ) − 1 k = ( λ min ( X X ∗ )) − 1 < 1 / (1 − t ) < 2. It should not be particularly surprising that this ev ent is highly likely . The matrix X has iid columns, and it is easy to see that E [ X X ∗ ] = I . The follo wing lemma shows X X ∗ is also close to I in the op erator norm, with high probabilit y: Lemma 5.2. Fix any 0 < t < 1 / 2 , and let E eig ( t ) denote the event t hat the fol lowing b ound holds: k X X ∗ − I k < t. (5.6) Then ther e exist numeric al c onstants C 1 , C 2 , C 3 al l strictly p ositive such that for al l p ≥ C 1 ( n/t ) 1 / 4 , P [ E eig ( t )] ≥ 1 − C 2 n exp  − C 3 t 2 p n log p  − p − 7 . (5.7) Lemma 5.2 is essentially a consequence of the matrix Chernoﬀ b ound of [T ro10]. Its proof is a bit technical, and so is delay ed to Section F of the app endix. F or no w, we take this result as giv en and commence the pro of of Theorem 5.1. Pr o of of The or em 5.1. On the ev ent E eig , X X ∗ is in vertible, and any pair ( ∆ A , ∆ X ) satisfying (3.1) also satisﬁes ∆ A = − A ∆ X X ∗ ( X X ∗ ) − 1 . (5.8) Hence, using that 5 k X ∗ ( X X ∗ ) − 1 k = 1 / q λ min ( X X ∗ ) and that for any matrices P , Q , R , k P QR k F ≤ k P kk R kk Q k F , on E eig (1 / 2) w e hav e k ∆ A k F ≤ k A k p λ min ( X X ∗ ) k ∆ X k F ≤ √ 2 k A k k ∆ X k F . (5.9) W e next sho w that for any pair ( ∆ A , ∆ X ) satisfying (3.1), ∆ X cannot b e to o concentrated on Ω. More precisely , we will sho w ∃ α 0 < ∞ such that for an y such ∆ X , kP Ω [ ∆ X ] k F ≤ α 0 kP Ω c [ ∆ X ] k F . (5.10) 5 This can b e shown via the singular v alue decomposition of X . 16 On E eig , the inv erse in (5.8) is justiﬁed, and (5.8) holds. W e can plug this relationship into the tangen t space constraint ∆ A X + A ∆ X = 0 , giving 0 = ∆ A X + A ∆ X = − A ∆ X X ∗ ( X X ∗ ) − 1 X + A ∆ X = A ∆ X  I − X ∗ ( X X ∗ ) − 1 X  . Ab o ve, P X . = X ∗ ( X X ∗ ) − 1 X is the pro jection matrix onto the range of X ∗ . W e hav e one further constrain t A ∗ i ∆ A e i = 0 ∀ i . W e in tro duce a more concise notation for this constraint, by letting C A : R n → R m × n via C A [ z ] = A diag ( z ) . (5.11) F or U = [ u 1 | u 2 | · · · | u n ] ∈ R m × n , the action of the adjoint of C A is giv en by C ∗ A [ U ] = [ h A 1 , u 1 i , . . . , h A n , u n i ] ∗ ∈ R n . (5.12) Hence, our second constraint can be expressed concisely via C ∗ A [ ∆ A ] = 0 ∈ R n . On E eig , an y ∆ X participating in a pair ( ∆ A , ∆ X ) ∈ T x ? M m ust satisfy A ∆ X ( I − P X ) = 0 and C ∗ A [ A ∆ X X ∗ ( X X ∗ ) − 1 ] = 0 . (5.13) It is conv enient to temp orarily express the constrain t (5.13) in vector form, as a constraint on δ x . = v ec[ ∆ X ] ∈ R np . In vector notation, (5.13) is equiv alent to M δ x = 0 , with M . =  ( I − P X ) ⊗ A C ∗ A (( X X ∗ ) − 1 X ⊗ A )  ∈ R ( mp + n ) × np . (5.14) In forming M , we hav e used the familiar identit y vec[ QRS ] = ( S ∗ ⊗ Q ) v ec[ R ] , for matrices Q , R , and S of compatible size. W e hav e used C A to denote the matrix v ersion of the op erator C A , uniquely deﬁned via 6 v ec[ C A [ z ]] = C A z ∀ z ∈ R m × n . (5.15) It will b e easier to w ork with a symmetric v ariant of the equation M δ x = 0 . Set T . = M ∗ M = ( I − P X ) ⊗ A ∗ A +  X ∗ ( X X ∗ ) − 1 ⊗ A ∗  C A C ∗ A  ( X X ∗ ) − 1 X ⊗ A  , (5.16) then M δ x = 0 if and only if T δ x = 0 . (5.17) Splitting δ x as δ x = P Ω δ x + P Ω c δ x , and m ultiplying (5.17) on the left b y P Ω giv es P Ω T P Ω δ x = − P Ω T P Ω c δ x , (5.18) or [ P Ω T P Ω ] ( P Ω δ x ) = − [ P Ω T P Ω c ] ( P Ω c δ x ) . (5.19) No w, although the matrix P Ω T P Ω is rank-deﬁcient, as we will see, its n ullspace do es not contain an y v ectors z supp orted on Ω. More quantitativ ely , let S Ω ⊂ R np denote the subspace of vectors whose supp ort is con tained in Ω (i.e., the solution space of P Ω z = z ), and deﬁne ξ . = inf z ∈ S Ω \{ 0 } k P Ω T P Ω z k k z k , (5.20) Then if ξ > 0, by (5.19) we ha ve k P Ω δ x k ≤ ξ − 1 k P Ω T P Ω δ x k = ξ − 1 k [ P Ω T P Ω c ] P Ω c δ x k ≤ ξ − 1 k P Ω T P Ω c k k P Ω c δ x k . 6 In particular, it is not diﬃcult to see that C A ∈ R mn × n is a blo c k diagonal matrix whose blo cks are the columns of A . 17 A calculation 7 sho ws that k P Ω T P Ω c k ≤ C k A k , and hence, th us far we ha ve sho wn k ∆ A k F ≤ √ 2 k A kk ∆ X k F ≤ √ 2 k A k ( k P Ω ∆ X k F + kP Ω c ∆ X k F ) ≤ √ 2 k A k  1 + C ξ − 1 k A k  kP Ω c ∆ X k F . (5.21) Our only remaining tasks are to low er b ound ξ to complete the b ound on α , and then verify that the failure probability is indeed small. W e carry out these tasks b elow, with some technical details asso ciated with bounding ξ dela yed to Section 6. The expression for T in (5.16) is quite complicated. Notice, ho wev er, that as p → ∞ , X X ∗ → I almost surely . If we can replace ( X X ∗ ) − 1 with I in (5.16), the expression will simplify signiﬁcantly . W e introduce ˆ T , this simpliﬁed approximation, giv en b y ˆ T . = ( I − X ∗ X ) ⊗ A ∗ A + ( X ∗ ⊗ A ∗ ) C A C ∗ A ( X ⊗ A ) = I ⊗ A ∗ A − ( X ∗ ⊗ A ∗ )( I − C A C ∗ A )( X ⊗ A ) . (5.22) The matrix I ⊗ A ∗ A is quite simple: for a matrix Z , ( I ⊗ A ∗ A )v ec[ Z ] = v ec[( A ∗ A ) Z ], and so ( A ∗ A ) simply acts columnwise on Z . W e will see that if the columns of Z are appropriately sp arse , then b ecause A is incoheren t, A ∗ A ≈ I will approximately preserve their norms. Hence, the restricted singular v alue ξ asso ciated with the matrix I ⊗ A ∗ A is well-behav ed. W e therefore let R denote the n uisance term in (5.22) R . = ( X ∗ ⊗ A ∗ )( I − C A C ∗ A )( X ⊗ A ) . (5.23) so that w e hav e ˆ T = I ⊗ A ∗ A − R , and T = I ⊗ A ∗ A − R + ( T − ˆ T ) . (5.24) In terms of these v ariables, ξ = inf z ∈ S Ω \{ 0 } ( k P Ω ( I ⊗ A ∗ A − R + T − ˆ T ) P Ω z k k z k ) ≥ inf z ∈ S Ω \{ 0 }  k P Ω ( I ⊗ A ∗ A ) P Ω z k k z k  − sup z 6 = 0  k P Ω RP Ω z k k z k  − sup z 6 = 0 ( k P Ω ( T − ˆ T ) P Ω z k k z k ) = inf z ∈ S Ω \{ 0 }  k P Ω ( I ⊗ A ∗ A ) P Ω z k k z k  − k P Ω RP Ω k − k P Ω ( T − ˆ T ) P Ω k . (5.25) The ﬁrst and third terms ab o ve require relatively little manipulation to con trol. In particular, in paragraph (i) b elo w, w e will sho w that inf z ∈ S Ω \{ 0 }  k P Ω ( I ⊗ A ∗ A ) P Ω z k k z k  ≥ 1 − kµ ( A ) . (5.26) In paragraph (ii) b elo w, w e will sho w that there is a constant t ? > 0 such that on E eig ( t ? ), k P Ω ( T − ˆ T ) P Ω k ≤ 1 / 8 . (5.27) 7 Notice that k P Ω T P Ω c k ≤ k P Ω T kk P Ω c k = k P Ω T k . Using (5.16), write k P Ω T k ≤ k P Ω ( I ⊗ A ∗ ) k   ( I − P X ) ⊗ A + ( X ∗ ( X X ∗ ) − 1 ⊗ I ) C A C ∗ A (( X X ∗ ) − 1 X ⊗ A )   ≤ k P Ω ( I ⊗ A ∗ ) kk ( I − P X ) ⊗ I + ( X ∗ ( X X ∗ ) − 1 ⊗ I ) C A C ∗ A (( X X ∗ ) − 1 X ⊗ I ) kk A k , ≤ k P Ω ( I ⊗ A ∗ ) k (1 + 1 /λ min ( X X ∗ )) k A k . Now, P Ω ( I ⊗ A ∗ ) is a block-diagonal matrix, with blocks given by A ∗ Ω 1 , . . . , A ∗ Ω p . By (C.2), the operator norm of each of these blo c ks is b ounded by a constant, say , c 1 . Hence, k P Ω ( I ⊗ A ∗ ) k is b ounded by c 1 as well. Similarly , on E eig , λ − 1 min ( X X ∗ ) is also b ounded by a constant, giving the desired expression. 18 The analysis of P Ω RP Ω is a bit trickier, requiring b oth additional algebraic manipulations and additional probability estimates. F or now, we will deﬁne an even t E R , on whic h the norm of this term is small: E R . = { ω | k P Ω RP Ω k ≤ 1 / 8 } . (5.28) In Section 6, we pro ve the follo wing lemma, whic h shows that E R is indeed lik ely: Lemma 5.3. L et E R b e the event that k P Ω RP Ω k ≤ 1 / 8 . Then ther e exist p ositive numeric al c onstants C 1 , . . . , C 6 such that whenever k ≤ min  C 1 n, C 2 µ ( A )  (5.29) and p > C 3 n 2 we have P [ E R ] ≥ 1 − C 4 p − 4 − C 5 n 2 exp  − C 6 k 2 p/n 2  . (5.30) Plugging (5.26), (5.27) and (5.28) into (5.25), w e obtain that on E eig ( t ? ) ∩ E R ξ ≥ 3 4 − kµ ( A ) . (5.31) So, assuming C 1 in the statement of Theorem 5.1 is such that k µ ( A ) < 1 / 2, we hav e ξ > 1 / 4. Plugging this v alue for ξ into (5.21), we ﬁnally obtain that on E eig ( t ? ) ∩ E R , for an y pair ( ∆ A , ∆ X ) in the tangen t space k ∆ A k F ≤ √ 2 k A k (1 + C 0 k A k ) kP Ω c [ ∆ X ] k F ≤ C 00 k A k 2 kP Ω c [ ∆ X ] k F . (5.32) Hence, the b ound claimed in Theorem 5.1 holds with probability at least 1 − P [ E eig ( t ? ) c ] − P [ E c R ]. When the constan ts C 1 and C 2 in Theorem 5.1 are chosen appropriately , the conditions of Lemmas 5.2 of Section F and 5.3 of Section 6 are veriﬁed. F rom Lemma 5.2, P [ E eig ( t ? ) c ] < c 1 n exp( − c 2 p/n log ( p )) + p − 7 . In Lemma 5.3 we sho w that P [ E c R ] < c 3 p − 4 + c 4 n 2 exp( − c 5 k 2 p/n 2 ) . Com bining the probabilities and consolidating p olynomial terms establishes the desired result. It remains only to show that the t wo bounds in (5.26) and (5.27) indeed hold. (i) Establishing (5.26) . F or this term, it is more conv enien t to work with matrices and the F rob enius norm, rather than vectors and the ` 2 norm. Fix any Z ∈ R n × p with z . = vec[ Z ] ∈ S Ω (i.e., Z has supp ort con tained in Ω). Then k P Ω ( I ⊗ A ∗ A ) P Ω z k 2 = kP Ω [ A ∗ A P Ω [ Z ] ] k 2 F = p X j =1    A ∗ Ω j A Ω j Z (Ω j , j )    2 2 ≥ σ 2 min ( A ∗ Ω j A Ω j ) X j k Z (Ω j , j ) k 2 2 ≥ k Z k 2 F (1 − kµ ( A )) 2 , (5.33) where in the ﬁnal step we hav e used that Z is supp orted on Ω, where we hav e used the b ound σ min ( A ∗ Ω j A Ω j ) > 1 − kµ ( A ), sho wn in (C.3) of App endix C. The b ound (5.33) holds for all z supp orted on Ω, and so (5.26) holds. 19 (ii) Establishing (5.27) . F or the term P Ω ( T − ˆ T ) P Ω , write Ξ . = ( X X ∗ ) − 1 − I , and notice that T − ˆ T can b e written as T − ˆ T = X ∗ X ⊗ A ∗ A −  X ∗ ( X X ∗ ) − 1 X  ⊗ A ∗ A +  X ∗ ( X X ∗ ) − 1 ⊗ A ∗  C A C ∗ A  ( X X ∗ ) − 1 X ⊗ A  − ( X ∗ ⊗ A ∗ ) C A C ∗ A ( X ⊗ A ) = ( X ∗ ⊗ A ∗ )( − Ξ ⊗ I )( X ⊗ A ) +  X ∗ ( X X ∗ ) − 1 ⊗ A ∗  C A C ∗ A  ( X X ∗ ) − 1 X ⊗ A  − ( X ∗ ⊗ A ∗ ) C A C ∗ A  ( X X ∗ ) − 1 X ⊗ A  + ( X ∗ ⊗ A ∗ ) C A C ∗ A  ( X X ∗ ) − 1 X ⊗ A  − ( X ∗ ⊗ A ∗ ) C A C ∗ A ( X ⊗ A ) = ( X ∗ ⊗ A ∗ )( − Ξ ⊗ I )( X ⊗ A ) + ( X ∗ ⊗ A ∗ )( Ξ ⊗ I ) C A C ∗ A  ( X X ∗ ) − 1 X ⊗ A  + ( X ∗ ⊗ A ∗ ) C A C ∗ A ( Ξ ⊗ I )( X ⊗ A ) = ( X ∗ ⊗ A ∗ )  ( C A C ∗ A − I ) Ξ ⊗ I + ( Ξ ⊗ I ) C A C ∗ A ( X X ∗ ) − 1  ( X ⊗ A ) . Therefore, using k C A C ∗ A − I k = 1 and k C A k = 1, we ha v e the estimate k P Ω ( T − ˆ T ) P Ω k ≤ k P Ω ( X ∗ ⊗ A ∗ ) k 2 ×  k Ξ k + k X X ∗ k − 1 k Ξ k  ≤ k P Ω ( I ⊗ A ∗ ) k 2 k X k 2  1 + k ( X X ∗ ) − 1 k  k Ξ k ≤ 6 × k P Ω ( I ⊗ A ∗ ) k 2 k Ξ k , (5.34) where the last b ound holds on E eig ( t ) for small enough t (e.g., t < 1 / 2 is suﬃcient). F rom the incoherence of A (i.e., (C.2)), k P Ω ( I ⊗ A ∗ ) k 2 = max j k A Ω j k 2 ≤ 1 + kµ ( A ) < 2 . (5.35) Hence, on the ev en t E eig , k P Ω ( T − ˆ T ) P Ω k ≤ 12 k Ξ k . Finally , on E eig ( t ), k Ξ k ≤ t/ (1 − t ); choosing t small enough guaran tees that on E eig ( t ), k P Ω ( T − ˆ T ) P Ω k ≤ 1 / 8 as desired (in particular, t < 1 / 97 suﬃces). Th us, (5.26) and (5.27) hold, and Theorem 5.1 is established. 6 Con trolling the residual P Ω RP Ω In this section, we estimate the norm of the residual P Ω RP Ω , where R was deﬁned in (5.23), and sho w that with high probabilit y it is b ounded by a small constant. T o e stablish this result, in Section 6.1 b elo w we ﬁrst develop a more conv enien t expression for P Ω RP Ω as a sum of random semideﬁnite matrices that are indep enden t conditioned on Ω. 6.1 Pro of of Lemma 5.3 Pr o of. W e b egin b y in tro ducing an additional bit of notation. F or X ∈ R n × p , w e write x i = e ∗ i X ∈ R 1 × p (6.1) for the i -th row of X , and x j = X e j ∈ R n (6.2) 20 for the j -th column of X . Similarly , we let Ω i = { j | ( i, j ) ∈ Ω } ⊆ [ p ] , (6.3) and Ω j . = { i | ( i, j ) ∈ Ω } ⊂ [ n ] . (6.4) Recalling the deﬁnition (5.23) and using the familiar identit y , ( P ⊗ Q ) = ( P ⊗ I )( I ⊗ Q ), w e ha ve R = ( X ∗ ⊗ I )( I ⊗ A ∗ )( I − C A C ∗ A )( I ⊗ A )( X ⊗ I ) . (6.5) The pro duct of the middle three terms is a block diagonal matrix ( I ⊗ A ∗ )( I − C A C ∗ A )( I ⊗ A ) =    A ∗ ( I − A 1 A ∗ 1 ) A . . . A ∗ ( I − A n A ∗ n ) A    . (6.6) F or compactness, let P i = I − A i A ∗ i ; notice that this is the pro jection matrix onto the orthogonal complemen t of A i . Then, expanding the pro duct in (6.5) more explicitly , w e hav e R =    P n b =1 X b, 1 X b, 1 A ∗ P b A . . . P n b =1 X b, 1 X b,p A ∗ P b A . . . . . . . . . P n b =1 X b,p X b, 1 A ∗ P b A . . . P n b =1 X b,p X b,p A ∗ P b A    . (6.7) Breaking this sum up into n terms (indexed b y common b ), w e hav e P Ω RP Ω = n X b =1 P Ω  x b ∗ x b ⊗ A ∗ P b A  P Ω . (6.8) If w e set Ψ i . = P Ω  x i ∗ x i ⊗ A ∗ P i A  P Ω = P Ω  P Ω i v i ∗ v i P Ω i ⊗ A ∗ P i A  P Ω (6.9) then w e hav e P Ω RP Ω = n X i =1 Ψ i . (6.10) This is a sum of random p ositiv e semideﬁnite matrices. Moreov er, from (6.9), w e observe that conditioned on Ω, the Ψ i are ﬁxed functions of indep enden t random vectors v i , and hence the Ψ i are conditionally indep enden t. W e w ould lik e to apply a matrix tail bound to this sum, conditioned on Ω. T o do this, we ﬁrst need to understand ho w the supp ort Ω aﬀects the expected size of Ψ i . With high probabilit y , the support Ω is quite regular. If we ﬁx any i ∈ [ n ], the expected size of Ω i is simply pk /n . In fact, b ecause the Ω j are indep enden t (and hence the ev ents j ∈ Ω i are indep enden t), | Ω i | concentrates near this v alue. Moreo ver, if i and i 0 are distinct, | Ω i ∩ Ω i 0 | concen trates ab out its exp ectation, which is b ounded by k 2 p/n 2 . W e deﬁne a set of “desirable” supp orts, for whic h these quan tities do not greatly exceed their exp ectations: O . =  Ω ⊂ [ n ] × [ p ] max i =1 ,...,n | Ω i | ≤ 3 pk / 2 n max i 6 = i 0 | Ω i ∩ Ω i 0 | ≤ 3 pk 2 / 2 n 2  . (6.11) It is not diﬃcult to show that the ev ent Ω ∈ O is ov erwhelmingly lik ely: 21 Lemma 6.1. With overwhelming pr ob ability, Ω ∈ O : P [Ω ∈ O ] ≥ 1 − n 2 exp  − pk 2 10 n 2  . (6.12) W e prov e Lemma 6.1 in Section 6.2 b elo w. No w, when Ω ∈ O , the norms of the rows of X also concen trate ab out their conditional ex- p ectations. W e deﬁne n even ts, on whic h the ro ws x i are not too large in norm, and also do not concen trate to o strongly on the in tersection Ω i 0 ∩ Ω i for an y i 0 6 = i : E i . =  ω max a 6 = i k x i P Ω a k ≤ 2 p k /n, and k x i k ≤ 2  . (6.13) W e further set E X . = ∩ n i =1 E i . (6.14) It is not hard to show that E X is o verwhelmingly lik ely: Lemma 6.2. F or any Ω ∈ O , P [ E X | Ω] ≥ 1 − n 2 exp  − k 2 p 4 n 2  . (6.15) W e prov e Lemma 6.2 in Section 6.3 b elo w. This lemma is useful b ecause whenever E i o ccurs, Ψ i is indeed small in norm: Lemma 6.3. L et E i b e the event deﬁne d in (6.13) , and let Ψ i denote the i -th r esidual term: Ψ i = P Ω  x i ∗ x i ⊗ A ∗ P i A  P Ω (6.16) Then on event E i , we have k Ψ i k ≤ 4 k /n + 24 k µ ( A ) . (6.17) W e prov e Lemma 6.3 in Section 6.4 b elo w. F or now, how ever, we show ho w the previous three lemmas, together with a matrix Chernoﬀ b ound, imply the desired result. F or conv enience, let Ψ . = P Ω RP Ω = P i Ψ i . Set ¯ Ψ i = Ψ i × 1 E i , (6.18) where 1 E i denotes the indicator random v ariable for the ev ent E i . By Lemma 6.3, ¯ Ψ i alw ays satisﬁes k ¯ Ψ i k ≤ 4 k /n + 24 k µ ( A ) . = B . (6.19) Conditioned on Ω, each ¯ Ψ i is a function of v i only , and hence the ¯ Ψ i are indep enden t conditioned on Ω. W e apply a sequence of manipulations to reduce the problem of b ounding the probabilit y that k Ψ k exceeds 1 / 8 to the problem of b ounding the probabilit y that k ¯ Ψ k exceeds 1 / 8: P [ k Ψ k ≥ 1 / 8] = P [ k Ψ k ≥ 1 / 8 | Ω ∈ O ] P [Ω ∈ O ] + P [ k Ψ k ≥ 1 / 8 | Ω ∈ O c ] P [Ω ∈ O c ] ≤ P [ k Ψ k ≥ 1 / 8 | Ω ∈ O ] + P [Ω ∈ O c ] ≤ max Ω 0 ∈O P [ k Ψ k ≥ 1 / 8 | Ω 0 ] + P [Ω ∈ O c ] ≤ max Ω 0 ∈O  P  k ¯ Ψ k ≥ 1 / 8 | Ω 0  + P  Ψ 6 = ¯ Ψ | Ω 0  + P [Ω ∈ O c ] ≤ max Ω 0 ∈O  P  k ¯ Ψ k ≥ 1 / 8 | Ω 0  + P [ ∪ i E c i | Ω 0 ]  + P [Ω ∈ O c ] (6.20) = max Ω 0 ∈O  P  k ¯ Ψ k ≥ 1 / 8 | Ω 0  + P [ E c X | Ω 0 ]  + P [Ω ∈ O c ] . (6.21) 22 In (6.20), we hav e used that b y deﬁnition on E i , ¯ Ψ i = Ψ i 1 E i is equal to Ψ i , while in the following line w e ha v e used the deﬁnition of E X = ∩ i E i . Now, Lemma 6.2 shows that conditioned on any Ω 0 ∈ O , E c X is ov erwhelmingly unlikely , while Lemma 6.1 sho ws that the even t Ω ∈ O c is ov erwhelmingly unlik ely . Plugging in the b ounds from those tw o lemmas, we ha v e that P [ k Ψ k ≥ 1 / 8] ≤ max Ω 0 ∈O P  k ¯ Ψ k ≥ 1 / 8 | Ω 0  + n 2 exp  − k 2 p 4 n 2  + n 2 exp  − k 2 p 10 n 2  ≤ max Ω 0 ∈O P  k ¯ Ψ k ≥ 1 / 8 | Ω 0  + 2 n 2 exp  − k 2 p 10 n 2  . (6.22) W e complete the pro of b y applying the matrix Chernoﬀ b ound (B.4) to the ﬁrst term. Fix an y Ω 0 ∈ O . W e need to estimate µ max = k E [ ¯ Ψ | Ω 0 ] k . Since 0  ¯ Ψ  Ψ alwa ys, µ max = k E [ ¯ Ψ | Ω 0 ] k ≤ k E [ Ψ | Ω 0 ] k . (6.23) This conditional exp ectation can be easily ev aluated using the expression for Ψ in (6.9) and the fact that E [ v i ∗ v i ] = ( n/k p ) I : E [ Ψ | Ω] = E V " n X i =1 P Ω  P Ω i v i ∗ v i P Ω i ⊗ A ∗ P i A  P Ω # = n k p n X i =1 P Ω ( P Ω i ⊗ A ∗ P i A ) P Ω . Notice that for each i , P Ω i ⊗ A ∗ P i A  P Ω i ⊗ A ∗ A , and so E [ Ψ | Ω ]  n k p n X i =1 P Ω ( P Ω i ⊗ A ∗ A ) P Ω = n k p P Ω n X i =1 P Ω i ⊗ A ∗ A ! P Ω (6.24) The matrix P i P Ω i is diagonal; its ( j, j ) element simply counts the num b er of nonzero en tries i in the j -th column of Ω. This num ber is a constant k , so P i P Ω i = k I , and E [ Ψ | Ω ]  n p P Ω ( I ⊗ A ∗ A ) P Ω . (6.25) The matrix P Ω ( I ⊗ A ∗ A ) P Ω is blo c k-diagonal, with j -th blo c k P Ω j A ∗ AP Ω j . This blo c k has norm b ounded b y k A Ω j k 2 . Using a calculation giv en in (C.2), this is in turn b ounded b y 3 / 2, provided k µ ( A ) < 1 / 2. Hence, we ha ve µ max ≤ 3 n/ 2 p. (6.26) W e apply the matrix Chernoﬀ b ound (B.3) with tµ max = 1 / 8, and hence t ≥ p/ 12 n gives P    ¯ Ψ   ≥ 1 / 8 | Ω  ≤ np  12 en p  1 / 8 B , (6.27) where we recall B is the b ound on the norm of the summands ¯ Ψ i . By c hoosing the constant C 1 > 0 in the statement of Lemma 5.3, w e can make the exp onen t ν = 1 / 8 B as large as desired; the probabilit y that k ¯ Ψ k exceeds 1 / 8 is b ounded as P    ¯ Ψ   ≥ 1 / 8 | Ω  ≤ C ( ν ) n 1+ ν p 1 − ν . (6.28) Assuming p ≥ C n 2 , and by appropriate choice of ν , we can make the right hand side smaller than C 0 p − 4 (here, the exp onen t 4 is clearly arbitrary). Plugging into (6.22) completes the proof. 23 6.2 Pro of of Lemma 6.1 Pr o of. Notice that | Ω i | = P p j =1 1 ( i,j ) ∈ Ω is a sum of p independent Bernoulli( k /n ) random v ariables. W rite Z j = 1 ( i,j ) ∈ Ω − E [ 1 ( i,j ) ∈ Ω ] = 1 ( i,j ) ∈ Ω − k/n. Then | Ω i | = pk/n + P p j =1 Z j . The Z j are indep enden t, zero mean, with magnitude b ounded by 1 and v ariance E [ Z 2 j ] = V ar[ 1 ( i,j ) ∈ Ω ] ≤ E h  1 ( i,j ) ∈ Ω  2 i = k n . (6.29) By Bernstein’s inequalit y , for any  > 0, P h X j Z j > p i ≤ exp − p 2 2 E [ Z 2 j ] + 2 / 3 ! ≤ exp  − p 2 2 k /n + 2 / 3  (6.30) Setting  = k / 2 n and then taking a union b ound o ver i ∈ [ n ] gives P  max i | Ω i | ≥ 3 2 k p n  ≤ n exp  − p k 10 n  . (6.31) Ab o ve, w e hav e used 8 + 4 / 3 < 10 to simplify the constant. Similarly , notice that | Ω i ∩ Ω i 0 | = p X j =1 1 ( i,j ) ∈ Ω 1 ( i 0 ,j ) ∈ Ω . = X j H j is a sum of indep enden t Bernoulli random v ariables H j whic h take on v alue one with probabilit y E [ H j ] = P [ H j = 1] =  n − 2 k − 2  /  n k  ≤ k 2 n 2 . (6.32) Set Z j = H j − E [ H j ]. Then, the b ound | Z j | ≤ 1 alwa ys holds, and furthermore E [ Z 2 j ] = V ar[ H j ] ≤ E [ H 2 j ] ≤ k 2 n 2 . (6.33) With these deﬁnitions, | Ω i ∩ Ω i 0 | ≤ pk 2 /n 2 + X j Z j . Again applying Bernstein’s inequality to P j Z j , setting  = k 2 / 2 n 2 and taking a union b ound o ver the  n 2  c hoices of distinct ( i, i 0 ), w e hav e P  max i 6 = i 0 | Ω i ∩ Ω i 0 | ≥ 3 2 k 2 p n 2  ≤  n 2  exp  − pk 2 10 n 2  . (6.34) Summing the failure probabilities in (6.31) and (6.34) (using that  n 2  + n < n 2 and that exp( − pk/ 10 n ) < exp( − pk 2 / 10 n 2 )), completes the pro of. 6.3 Pro of of Lemma 6.2 Pr o of. This pro of is an exercise in Gaussian measure concentration. Equation (2.35) of [Led01] implies that if v is an iid sequence of N (0 , σ 2 ) random v ariables, and f is a p ositiv ely homogeneous, 1-Lipsc hitz function, then P [ f ( v ) ≥ E [ f ( v )] + t ] ≤ exp  − t 2 2 σ 2  . (6.35) 24 No w, with Ω ﬁxed, k x i k = k v i P Ω i k . = f ( v i ) is a 1-Lipsc hitz function of the iid N (0 , n/k p ) vector v i . Since for any Ω ∈ O , | Ω i | ≤ 3 pk / 2 n , E  k x i k | Ω  ≤ p E [ k x i k 2 | Ω] = p | Ω i | n/k p ≤ p 3 / 2 . (6.36) Hence, for Ω ∈ O , P [ k x i k ≥ 2 | Ω] ≤ P h f ( v i ) ≥ E [ f ( v i ) | Ω] + (2 − p 3 / 2) | Ω i ≤ exp  − k p 4 n  , (6.37) where w e hav e used that (2 − p 3 / 2) 2 / 2 ≈ 0 . 3005 > 1 / 4 to simplify the constant. No w, ﬁx i 0 6 = i . W e apply exactly the same reasoning to k x i P Ω i 0 k = k v i P Ω i P Ω i 0 k = k v i P Ω i ∩ Ω i 0 k . = g ( v i ) . Again, g ( · ) is a 1-Lipsc hitz function of v i . F urthermore, for Ω ∈ O , | Ω i ∩ Ω i 0 | ≤ 3 pk 2 / 2 n 2 , and E [ g ( v i ) | Ω] ≤ p 3 k / 2 n. (6.38) Again, P h g ( v i ) ≥ 2 p k /n | Ω i ≤ P h g ( v i ) ≥ E [ g ( v i ) | Ω] + (2 − p 3 / 2) p k /n | Ω i ≤ exp  − k 2 p 4 n 2  . A union bound ov er all n c hoices of i in (6.37) and all n ( n − 1) ordered pairs ( i, i 0 ) completes the pro of. 6.4 Pro of of Lemma 6.3 Pr o of. W e will sho w the calculations for i = 1. An identical argument w orks for i = 2 , . . . , n as well. Since A is incoherent, A ∗ P 1 A = A ∗ A − A ∗ A 1 A ∗ 1 A ≈ I − e 1 e ∗ 1 , and so we set ∆ . = A ∗ P 1 A − ( I − e 1 e ∗ 1 ) ∈ R n × n . (6.39) W e notice that since µ ( A ) ≤ 1, k ∆ k ∞ ≤ k A ∗ A − I k ∞ + k A ∗ A 1 A ∗ 1 A − e 1 e ∗ 1 k ∞ ≤ 2 µ ( A ) . (6.40) W rite k Ψ 1 k =    P Ω  x 1 ∗ x 1 ⊗ ( I − e 1 e ∗ 1 + ∆ )  P Ω    ≤    P Ω  x 1 ∗ x 1 ⊗ ( I − e 1 e ∗ 1 )  P Ω    +    P Ω ( x 1 ∗ x 1 ⊗ ∆ ) P Ω    (6.41) W e handle the tw o terms individually . F or the ﬁrst term, L . = P Ω  x 1 ∗ x 1 ⊗ ( I − e 1 e ∗ 1 )  P Ω ∈ R np × np , (6.42) w e let L : R n × p → R n × p b e the equiv alen t linear op erator suc h that for all Q ∈ R n × p , v ec [ L [ Q ] ] = L v ec[ Q ] . (6.43) The norm of L as a linear op erator from ` 2 to ` 2 is the same as the induced norm on L : k L k = kLk . = sup Q 6 = 0 kL [ Q ] k F k Q k F . 25 F rom (6.42) and the relationship vec[ P QR ] = ( R ∗ ⊗ P ) vec[ Q ], the operator L [ Q ] is giv en by L [ Q ] = P Ω h ( I − e 1 e ∗ 1 ) P Ω [ Q ] x 1 ∗ x 1 i . (6.44) F or any H ∈ R n × p , w e can express the pro jection P Ω of H onto Ω via its action on the rows of H : P Ω [ H ] = n X a =1 e a e ∗ a H P Ω a . (6.45) Applying this expression twice, (6.44) becomes L [ Q ] = n X a =1 e a e ∗ a h ( I − e 1 e ∗ 1 ) P Ω [ Q ] x 1 ∗ x 1 i P Ω a = n X a =2 e a e ∗ a P Ω [ Q ] x 1 ∗ x 1 P Ω a = n X a =2 e a e ∗ a n X b =1 e b e ∗ b Q P Ω b ! x 1 ∗ x 1 P Ω a = n X a =2 e a e ∗ a Q P Ω a x 1 ∗ x 1 P Ω a . (6.46) Since the a -th summand o ccupies only the a -th ro w, w e can write kL [ Q ] k 2 F as the sum of the squared ` 2 norms of the terms in the ab o v e expression: kL [ Q ] k 2 F = n X a =2 k e ∗ a Q P Ω a x 1 ∗ x 1 P Ω a k 2 ≤ n X a =2 k e ∗ a Q k 2 k x 1 P Ω a k 4 ≤ 16 k 2 n 2 k Q k 2 F , (6.47) where ab o ve w e hav e used that on E 1 , k x 1 P Ω a k ≤ 2 p k /n for ev ery a 6 = 1. W e conclude that k L k ≤ 4 k /n. (6.48) W e next address the second term in (6.41). Deﬁne W . = P Ω  x 1 ∗ x 1 ⊗ ∆  P Ω ∈ R np × np . (6.49) W e need to bound the op erator norm of W . As ab o ve, w e asso ciate a linear map W : R n × p → R n × p , giv en by W [ Q ] = P Ω h ∆ P Ω [ Q ] x 1 ∗ x 1 i = n X a,b =1 e a e ∗ a ∆ e b e ∗ b Q P Ω b x 1 ∗ x 1 P Ω a . (6.50) In the abov e expression, terms for whic h a, b 6 = 1 will be easily handled, since on E 1 , k x 1 P Ω a k ≤ 2 p k /n for ev ery a 6 = 1. W e therefore break the ab o v e summation in to four terms: T 1 . = e 1 e ∗ 1 ∆ e 1 e ∗ 1 Q P Ω 1 x 1 ∗ x 1 P Ω 1 , (6.51) T 2 . = n X b =2 e 1 e ∗ 1 ∆ e b e ∗ b Q P Ω b x 1 ∗ x 1 P Ω 1 , (6.52) T 3 . = n X a =2 e a e ∗ a ∆ e 1 e ∗ 1 Q P Ω 1 x 1 ∗ x 1 P Ω a , (6.53) and T 4 . = n X a,b =2 e a e ∗ a ∆ e b e ∗ b Q P Ω b x 1 ∗ x 1 P Ω a . (6.54) In terms of these four quantities, W [ Q ] = T 1 + T 2 + T 3 + T 4 . (6.55) 26 No w, e ∗ 1 ∆ e 1 = A ∗ 1 A 1 − ( A ∗ 1 A 1 ) 2 = 0, since A ∗ 1 A 1 = 1. Plugging e ∗ 1 ∆ e 1 = 0 into (6.51), w e ha ve T 1 = 0 . Below, w e sho w that on E 1 , the follo wing b ounds on the terms T 2 , T 3 , T 4 hold: k T 2 k F ≤ 8 µ ( A ) √ k k Q k F , (6.56) k T 3 k F ≤ 8 µ ( A ) √ k k Q k F , (6.57) k T 4 k F ≤ 8 µ ( A ) k k Q k F . (6.58) Hence, on E 1 , kW [ Q ] k F ≤  16 µ ( A ) √ k + 8 µ ( A ) k  k Q k F , (6.59) and so k W k ≤ 16 µ ( A ) √ k + 8 µ ( A ) k ≤ 24 k µ ( A ). Combining this observ ation with (6.48) gives the desired result, (6.17). W e just ha ve to establish the three inequalities (6.56)-(6.58). Paragraphs (i)-(iii) b elo w do this. (i) Establishing (6.56) . F or the term T 2 deﬁned in (6.52), notice k T 2 k F =    e ∗ 1 ∆  X b e b e ∗ b Q P Ω b x 1 ∗  x 1 P Ω 1    2 ≤    e ∗ 1 ∆    2    X b e b e ∗ b Q P Ω b x 1 ∗    2    x 1 P Ω 1    2 (6.60) ≤ √ n k e ∗ 1 ∆ k ∞ ×    X b e b e ∗ b Q P Ω b x 1 ∗    2 × 2 (6.61) ≤ 4 √ n µ ( A )    X b e b e ∗ b Q P Ω b x 1 ∗    2 . (6.62) Ab o ve, we ha v e used: the Cauch y-Sc hw arz inequality in (6.60), the b ound k X 1 k ≤ 2 on E 1 in (6.61), and the b ound k ∆ k ∞ ≤ 2 µ ( A ) in (6.62). Now,   X b e b e ∗ b Q P Ω b x 1 ∗   2 2 = X b ( e ∗ b Q P Ω b x 1 ∗ ) 2 ≤ X b k e ∗ b Q k 2 2 k x 1 P Ω b k 2 2 ≤ 4 k k Q k 2 F /n. (6.63) Com bining (6.62) and (6.63) establishes (6.56). (ii) Establishing (6.57) . F or the term T 3 deﬁned in (6.53), we ha ve k T 3 k 2 F = n X a =2 ( e ∗ a ∆ e 1 ) 2    e ∗ 1 Q P Ω 1 x 1 ∗ x 1 P Ω a    2 ≤ 4 µ 2 ( A ) n X a =2 k e ∗ 1 Q k 2 k x 1 P Ω 1 k 2 k x 1 P Ω a k 2 , (6.64) ≤ 4 µ 2 ( A ) × 4 × 4( k /n ) × ( n − 1) k e ∗ 1 Q k 2 . (6.65) In (6.64) w e ha v e used the bound k ∆ k ∞ ≤ 2 µ ( A ) and the Cauch y-Sch w arz inequalit y; in (6.64) w e ha ve plugged in the b ounds k X 1 k ≤ 2 and k X 1 P Ω a k ≤ 2 p k /n . Finally , conserv atively b ounding k e ∗ 1 Q k b y k Q k F and taking the square ro ot of both sides gives the desired result, (6.57). 27 (iii) Establishing (6.58) . The ﬁnal term, T 4 requires a bit more manipulation. Expressing k T 4 k 2 F as a sum of squared ` 2 norms of the rows of T 4 giv es k T 4 k 2 F = n X a =2    n X b =2 e ∗ a ∆ e b e ∗ b QP Ω b x 1 ∗ × x 1 P Ω a    2 = n X a =2 k x 1 P Ω a k 2  n X b =2 e ∗ a ∆ e b × e ∗ b QP Ω b x 1 ∗  2 (6.66) ≤ 4( k /n ) × n X a =2  e ∗ a ∆ n X b =2 e b e ∗ b QP Ω b x 1 ∗  2 (6.67) ≤ 4( k /n ) × n X a =2 k e ∗ a ∆ k 2 2    n X b =2 e b e ∗ b QP Ω b x 1 ∗    2 2 (6.68) ≤ 4( k /n ) × k ∆ k 2 F ×    n X b =2 e b e ∗ b QP Ω b x 1 ∗    2 2 (6.69) ≤ 4( k /n ) × n 2 k ∆ k 2 ∞ × n X b =2  e ∗ b QP Ω b x 1 ∗  2 (6.70) ≤ 16 k n µ 2 ( A ) × n X b =2 k e ∗ b Q k 2 2 k P Ω b x 1 ∗ k 2 2 (6.71) ≤ 64 k 2 µ 2 ( A ) × n X b =2 k e ∗ b Q k 2 . (6.72) Bounding the summation by k Q k 2 F and taking square ro ots gives (6.58). This completes the pro of of Lemma 6.3. References [AEB06] M. Aharon, M. Elad, and A. Bruc kstein. The K-SVD: An algorithm for designing o vercomplete dictionaries for sparse represen tation. IEEE T r ansactions on Signal Pr o- c essing , 54(11):4311–4322, 2006. [ANR74] N. Ahmed, T. Natara jan, and K. Rao. Discrete Cosine Transform. IEEE T r ansactions on Computers , pages 90–93, 1974. [A W02] Ahlswede and A. Winter. Strong conv erse for identiﬁcation via quantum channels. IEEE T r ansactions on Information The ory , 48(3):569–579, 2002. [BDE09] A. M. Bruc kstein, D. L. Donoho, and M. Elad. F rom sparse solutions of systems of equations to sparse modeling of signals and images. SIAM R eview , 51(1):34–81, 2009. [BE08] O. Bryt and M. Elad. Compression of facial images using the K-SVD algorithm. Journal of Visual Communic ation and Image R epr esentation , 19(4):270–283, 2008. [Can08] E. Cand` es. The restricted isometry prop erty and its implications for compressed sens- ing. Comptes R endus Mathematique , 346(9-10):589–592, 2008. [CDD Y06] E. Cand` es, L. Demanet, D. Donoho, and L. Ying. F ast discrete curv elet transformation. Multisc ale Mo deling and Simulation , 5:861–899, 2006. [CLMW09] E. Cand` es, X. Li, Y. Ma, and J. W righ t. Robust principal comp onent analysis? Av ail- able at , 2009. 28 [CP09] E. Cand` es and Y. Plan. Near-ideal mo del selection by ` 1 minimization. Annals of Statistics , 37:2145–2177, 2009. [CP10] E. Cand ` es and Y. Plan. A probabilistic RIP-less theory of compressed sensing. Av ail- able at , 2010. [CR08] E. Cand` es and B. Rech t. Exact matrix completion via conv ex optimization. F ounda- tions of Computational Mathematics , 9:717–772, 2008. [CT05] E. Cand` es and T. T ao. Deco ding by linear programming. IEEE T r ansactions on Information The ory , 51(12):4203–4215, 2005. [CT09] E. Cand ` es and T. T ao. The p o w er of conv ex relaxation: Near-optimal matrix comple- tion. IEEE T r ansactions on Information The ory , 56(5):2053–2080, 2009. [DE03] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via ` 1 minimization. Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of Americ a , 100(5):2197–2202, March 2003. [DT09] D. Donoho and J. T anner. Coun ting faces of randomly pro jected p olytop es when the pro jection radically lo wers dimension. Journal of the Americ an Mathematic al So ciety , 22(1):1–53, 2009. [EA06] M. Elad and M. Aharon. Image denoising via sparse and redundant representations o ver learned dictionaries. IEEE T r ansactions on Image Pr o c essing , 15(12):3736–3745, 2006. [EAHH99] K. Engan, S. Aase, and J. Hakon-Huso y . Metho d of optimal directions for frame design. In ICASSP , volu me 5, pages 2443–2446, 1999. [F uc04] J. F uchs. On sparse represen tations in arbitrary redundant bases. IEEE T r ansactions on Information The ory , 50(6), 2004. [GN03] R. Grib on v al and M. Nielsen. Sparse decomp ositions in unions of bases. IEEE T r ans- actions on Information The ory , 49:3320–3325, 2003. [Gro09] D. Gross. Reco v ering lo w-rank matrices from a few co eﬃcien ts in any basis. Av ailable at , 2009. [GS10] R. Grib on v al and K. Schnass. Dictionary iden tiﬁcation - sparse matrix factorization via ` 1 -minimization. IEEE T r ansactions on Information The ory , 56(7):3523–3539, 2010. [GWW11] Q. Geng, H. W ang, and J. W righ t. Algorithms for exact dictionary learning by ` 1 - minimization. T e chnic al R ep ort , 2011. [JO80] K. Jittorntrum and M. Osb orne. Strong uniqueness and second order conv ergence in nonlinear discrete appro ximation. Numerische Mathematik , 34:439–455, 1980. [Kah64] J. Kahane. Sur les sommes vectorielles P ± u n . Comptes R endus Mathematique , 259:2577–2580, 1964. [KDMR + 03] K. Kreutz-Delgado, J. Murra y , B. Rao, K. Engan, T. Lee, and T. Sejnowski. Dictionary learning algorithms for sparse representation. Neur al Computation , 15(20):349–396, 2003. [Led01] M. Ledoux. The Conc entr ation of Me asur e Phenomenon, Mathematic al Surveys and Mono gr aphs 89 . American Mathematical So ciety , Pro vidence, RI, 2001. 29 [LO94] R. Latala and K. Oleszkiewicz. On the b est constan t in the Khintc hine-Kahane in- equalit y . Studia Mathematic a , 109(1):101–104, 1994. [L T91] M. Ledoux and M. T alagrand. Pr ob ability in Banach Sp ac es: Isop erimetry and Pr o- c esses . Springer, 1991. [MBP + 08] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for lo cal image analysis. In Computer Vision and Pattern R e c o gnition (CVPR) , 2008. [MBPS10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse co ding. Journal of Machine L e arning R ese ar ch , 11:19–60, 2010. [MG84] J. Morlet and A. Grossman. Decomp osition of hardy functions into square in tegrable w av elets of constant shap e. SIAM Journal on Mathematic al A nalysis , 15:723–736, 1984. [MY09] N. Meinshausen and B. Y u. Lasso-type recov ery of sparse representations for high- dimensional data. Annals of Statistics , 37(1):246–270, 2009. [NR WY09] S. Negahban, P . Ravikumar, M. W ainwrigh t, and B. Y u. A uniﬁed framew ork for the analysis of regularized m estimators. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2009. [OF96] B. Olshausen and D. Field. Emergence of simple-cell receptive ﬁeld prop erties b y learning a sparse co de for natural images. Natur e , 381(6538):607–609, 1996. [RBE10] R. Rubinstein, A. Bruckstein, and M. Elad. Dictionaries for sparse representation mo deling. Pr o c e e dings of the IEEE , 98(6):1045–1057, 2010. [RS08] F. Ro driguez and G. Sapiro. Sparse representations for image classiﬁcation: learning discriminativ e and reconstructive non-parametric dictionaries. Av ailable at http:// www.ima.umn.edu/preprints/jun2008/2213.pdf , 2008. [T ro08] J. T ropp. Norms of random submatrices and sparse approximation. Comptes R endus Mathematique , 346:1271–1274, 2008. [T ro10] J. T ropp. User-friendly tail b ounds for matrix martingales. Av ailable at http:// arxiv.org/abs/1004.4389v4 , 2010. [W al91] G. W allace. The JPEG still picture compression standard. Communic ations of the A CM , 34(4):30–44, 1991. [WM10] J. W right and Y. Ma. Dense error correction via ` 1 -minimization. IEEE T r ansactions on Information The ory , 56(7):3540 – 3560, 2010. [YWHM10] J. Y ang, J. W right, T. Huang, and Y. Ma. Image super-resolution via sparse repres en- tation. IEEE T r ansactions on Image Pr o c essing , 19(11):2861–2873, 2010. [ZY06] P . Zhao and B. Y u. On mo del selection consistency of Lasso. Journal of Machine L e arning R ese ar ch , 7:2541–2563, 2006. 30 A Conditioning of the Linearized Subproblem In Section 2, we sketc hed an argumen t for wh y the restricted isometry property would not be useful for analyzing the linearized subproblem in dictionary learning. W e demonstrated a p erturbation pair ( ∆ A , ∆ X ) lying in nullspace of the linear constraints, such that ∆ X has the same sparsity as X . The reader might wonder whether this analysis can b e made rigorous, and justiﬁably so, since the classical RIP analysis p ertains to an ` 1 -minimization problem in whic h all of the v ariables are w eighted equally , whereas in the linearized subproblem, ∆ A is not p enalized. In this section, w e sho w a more precise sense in which the RIP is violated. T o see this, notice that whenever X has full row rank n , w e can write ∆ A = − A ∆ X X ∗ ( X X ∗ ) − 1 . (A.1) Eliminating ∆ A yields the equiv alent problem minimize k X + ∆ X k 1 sub ject to A ∆ X ( I − P X ) = 0 , h A i , A ∆ X X ∗ ( X X ∗ ) − 1 e i i = 0 ∀ i, (A.2) where P X = X ∗ ( X X ∗ ) − 1 X . If w e m ak e the substitution Z = X + ∆ X , we obtain an equiv alent problem, minimize k Z k 1 sub ject to AZ ( I − P X ) = AX ( I − P X ) , h A i , AZ X ∗ ( X X ∗ ) − 1 e i i = h A i , AX X ∗ ( X X ∗ ) − 1 e i i ∀ i. (A.3) This is an equality constrained ` 1 norm minimization problem. W e wish to know whether Z = X is the unique optimal solution (corresp onding to ∆ X = 0 b eing uniquely optimal for the original prob- lem). Let π b e an y p ermutation of [ n ] with no ﬁxed point, and let Π ∈ R n × n b e the corresponding p erm utation matrix. Then it is easy to verify that if w e set H = Π X ∈ R n , AH ( I − P X ) = 0 , and |h A i , AH X ∗ ( X X ∗ ) − 1 e i i| ≤ µ ( A ) . It is also obvious that H has the same num b er of nonzeros, and in fact the same num ber of nonzeros in each column as the desired solution X . Hence, if µ ( A ) = 0 (i.e., A is an orthonormal basis), H lies in the nullspace of the linear constrain ts in (A.3), and so the RIP cannot hold for this problem. Hence, in what is arguably the b est p ossible case for recov ering A and X , the linearized subproblem cannot hav e the RIP . Moreov er, applying linear op erations to the constraints in (A.3) cannot help, since H lies strictly in the nullspace of these constraints. When µ ( A ) is nonzero but small, as in our ab o v e problem, H still lies very near the n ullspace, and the RIP do es not hold with an y useful constan t for (A.3). Of course, strictly sp eaking, when µ ( A ) > 0 this argument does not preclude the possiblity that there is some linear transformation of the equalit y constrain ts in (A.3) that do es ha ve the RIP . B T ec hnical T o ols In this section, we quote tw o results used in our arguments. The ﬁrst, which plays a k ey role, is the matrix Chernoﬀ b ound of T ropp [T ro10]. This con venien t and pow erful result builds on ideas in tro duced b y Ahlsw ede and Win ter [A W02]. Theorem B.1 (Matrix Chernoﬀ Bound, [T ro10] Theorem 2.5) . L et M 1 , . . . , M n b e a ﬁnite se quenc e of indep endent r andom p ositive-semideﬁnite matric es of dimension d . Supp ose that for e ach M i , λ max ( M i ) ≤ B almost sur ely. Set µ min = λ min ( P i E [ M i ]) and µ max = λ max ( P i E [ M i ]) . Then the fol lowing two b ounds hold: P h λ min  X i M i  ≤ tµ min i ≤ d exp  − (1 − t ) 2 µ min / 2 B  , ∀ t ∈ [0 , 1) , (B.1) P h λ max  X i M i  ≥ (1 + t ) µ max i ≤ d  e t (1 + t ) 1+ t  µ max /B , ∀ t ≥ 0 . (B.2) 31 Tw o simpliﬁcations of the upp er tail are useful: P h    X i M i    ≥ (1 + t ) µ max i ≤ d exp  − t 2 µ max / 4 B  , ∀ t ∈ [0 , 1] , (B.3) and P h    X i M i    ≥ t µ max i ≤ d  e t  tµ max /B , ∀ t > e. (B.4) The second is given in [T ro10], while the ﬁrst follo ws from (B.2) and the inequalit y t − (1 + t ) log(1 + t ) ≤ − t 2 / 4, whic h by con vexit y (or calculus) can b e sho wn to hold on [0 , 1]. The second result w e quote here is the classical Kahane-Khin tc hine inequalit y , here with constant 1 / √ 2 found b y Latala and Oleszkiewicz [LO94]: Theorem B.2 (Kahane-Khintc hine Inequality [Kah64], [LO94] Theorem 1) . L et σ 1 , . . . , σ n b e an iid se quenc e of R ademacher r andom variables (i.e., variables that take on ± 1 with e qual pr ob ability), and let x 1 , . . . , x n b e a ﬁxe d se quenc e of ve ctors in a norme d sp ac e V . Then 1 √ 2  E h    X i σ i x i    2 V i 1 / 2 ≤ E h    X i σ i x i    V i ≤  E h    X i σ i x i    2 V i 1 / 2 (B.5) This result has the following useful consequence: Corollary B.3. L et M ∈ R m × n b e any ﬁxe d matrix, and v ∈ R n b e an iid N (0 , σ 2 ) ve ctor. Then σ √ π k M k F ≤ E [ k M v k 2 ] ≤ σ k M k F . (B.6) C Consequences of Incoherence In this section, we assemble sev eral useful consequences of the assumption that A has lo w mutual coherence. All of the follo wing b ounds are w ell kno wn [F uc04]; w e record their statemen ts and (v ery simple) proofs here for completeness. As abov e, let A ∈ R m × n b e a matrix with unit norm columns and mutual coherence µ ( A ). A b ound on the m utual coherence immediately implies ab out on the norm of A . Set ∆ = A ∗ A − I . Then k A k 2 = k A ∗ A k = k I + ∆ k ≤ 1 + k ∆ k ≤ 1 + k ∆ k F ≤ 1 + n k ∆ k ∞ = 1 + nµ ( A ) . (C.1) F or submatrices of A , tighter b ounds can b e obtained in a similar manner: if L ∈  [ n ] k  , the same argumen t shows k A L k 2 = k A ∗ L A L k ≤ 1 + k µ ( A ) . (C.2) Similarly , via eigenv alue p erturbation bounds, λ min ( A ∗ L A L ) ≥ 1 − k µ ( A ) . (C.3) In particular, if we assume that kµ ( A ) < 1 / 2, we ha ve k ( A ∗ L A L ) − 1 k ≤ 2 . (C.4) W e can obtain a tighter result by using the Neumann series representation of the inv erse. Supp ose that k µ ( A ) < 1 / 2, and write A ∗ L A L = I + H , and note that k H k F < k µ ( A ). Then using ( A ∗ L A L ) − 1 = ∞ X t =0 ( − 1) t H t , (C.5) w e hav e k ( A ∗ L A L ) − 1 − I k F = k ∞ X t =1 ( − H ) t k F ≤ ∞ X t =1 k − H k t F ≤ k µ ( A ) / (1 − k µ ( A )) < 2 k µ ( A ) . (C.6) 32 D Lo cal Prop erties In this section, we prov e Lemma 3.1, which reduces the question of whether x is a local optimum of f ov er M to one of whether δ = 0 minimized f ( x + δ ) ov er T x M . The reasoning b ehind this lemma is simple. The function f ( A , X ) = k X k 1 is Lipsc hitz (its Lipsc hitz constan t L is at most √ np ). At the same time, f is a p olyhedral semi-norm. This means that the (unbounded) unit ball of f , B . = { x | f ( x ) ≤ 1 } is a polyhedral set. Let r = min { f ( x + δ ) | δ ∈ T x M} . The set of optimal δ precisely corresp ond to the the p oin ts − x + rB ∩ T x M . In particular, if the optimizer is unique, then rB intersects x + T x M at a single p oint. Since B is polyhedral, this in tersection is “sharp”: f increases linearly as w e mov e a w a y from the optimal p oint. This property , sometimes referred to in the optimization literature as str ong uniqueness implies that higher order terms due to the curv ature of M are negligible; if δ = 0 is a unique optimum o ver the tangent space, x is lo cally optimal ov er M . W e now formally pro ve Lemma 3.1. Pr o of. F or any δ ∈ T x M , let x δ : ( − ε, ε ) → M b e the unique geo desic satisfying x δ (0) = x and ˙ x δ (0) = δ . Then x δ ( t ) = x + t δ + O ( t 2 ) . (D.1) W e ﬁrst prov e that optimality ov er the tangen t space is necess ary . Indeed, supp ose there exists δ ∈ T x M with f ( x + δ ) < f ( x ). Set τ = f ( x ) − f ( x + δ ) > 0. By conv exity , for η ∈ [0 , 1], f ( x + η δ ) ≤ f ( x ) − η τ . (D.2) But, f ( x δ ( t )) = f ( x + η δ + ( x δ ( t ) − ( x + η δ ))) ≤ f ( x + η δ ) + L k x δ ( t ) − x − η δ k 2 ≤ f ( x ) − η τ + L k ( t − η ) δ k 2 + O ( Lt 2 ) . (D.3) When t is suﬃciently small and let η = t , this v alue is strictly smaller than f ( x ). Con versely , supp ose that δ = 0 is the unique minimizer of f ( x + δ ) ov er δ ∈ T x M . W e will sho w that this minimizer is str ongly unique (see e.g., [JO80]), i.e., ∃ β > 0 such that f ( x + δ ) ≥ f ( x ) + β k δ k ∀ δ ∈ T x M . (D.4) T o see this, notice that if we write x = ( A , X ) and δ = ( ∆ A , ∆ X ), then f ( x ) = k X k 1 . Hence, if w e set r 0 = min {| X ij | | X ij 6 = 0 } > 0, whenever k ∆ X k ∞ < r 0 and t < 1, we ha v e k X + t ∆ X k 1 = k X k 1 + t h Σ , ∆ X i + t kP Ω c ∆ X k 1 = k X k 1 + t h Σ + sign( P Ω c ∆ X ) , ∆ X i . Set β ( δ ) . = h Σ + sign( P Ω c ∆ X ) , ∆ X i , and notice that β is a con tinuous function of δ . Let β ? = inf δ ∈ T x M , k δ k = r 0 / 2 β ( δ ) ≥ 0 . (D.5) Then for all δ ∈ T x M with k δ k ≤ r 0 / 2 w e hav e f ( x + δ ) ≥ f ( x ) + (2 β ? /r 0 ) k δ k . (D.6) Moreo ver, by conv exity of f , the same bound holds for all δ ∈ T x M (regardless of k δ k ). It remains to sho w that β ? is strictly larger than zero. On the contrary , suppose β ? = 0. Since the inﬁm um in (D.5) is tak en ov er a compact set, it is achiev ed b y some δ ? ∈ T x M . Hence, if β ? = 0, f ( x + δ ? ) = f ( x ), con tradicting the uniqueness of the minimizer x . This establishes (D.4). Hence, con tinuing forw ard, we ha v e f ( x δ ( η )) ≥ f ( x ) + η β k δ k − O ( β η 2 ) . (D.7) F or η suﬃcien tly small the righ t hand side is strictly greater than f ( x ). 33 E A Decoupling Lemma The follo wing tec hnical lemma concerns the exp ected norm of the restriction P Ω M P Ω of a matrix M with no diagonal elements to a random diagonal block. Its proof is an application of a w ell-kno wn decoupling technique [L T91]. In particular, sev eral steps are quite similar to manipulations in the pro of of Proposition 2.1 of [T ro08], and are rep eated here for completeness. Lemma E.1. Fix M ∈ R n × n with al l diagonal elements e qual to zer o. L et Ω ∼ uni  [ n ] k  b e a uniform r andom subset of size k . Then the fol lowing estimate holds: E [ k P Ω M P Ω k F ] ≤ 16 r k n E [ k M P Ω k F ] . (E.1) Pr o of. Let Λ denote a diagonal matrix whose entries are iid Bernoulli random v ariables taking on v alue 1 with probability k /n . Let k 0 . = tr[ Λ ] denote the num b er of nonzeros in Λ ; k 0 is a binomial random v ariable. Then E [ k Λ M Λ k F ] = n X s =0 P [ k 0 = s ] E [ k Λ M Λ k F | k 0 = s ] , (E.2) ≥ n X s = k P [ k 0 = s ] E [ k Λ M Λ k F | k 0 = s ] . (E.3) No w, conditioned on k 0 = s , the nonzeros on the diagonal of Λ are distributed according to a uniform distribution on  [ n ] s  . F urthermore, whenever Ω ⊂ supp ort( Λ ), k P Ω M P Ω k F ≤ k Λ M Λ k F , since Ω restricts to a smaller submatrix. Hence, ∀ s ≥ k , E [ k Λ M Λ k F | k 0 = s ] ≥ E [ k P Ω M P Ω k F ] . (E.4) Plugging in to (E.3), and using that k is a median of the binomial random v ariable k 0 , w e hav e E [ k Λ M Λ k F ] ≥ n X s = k P [ k 0 = s ] E [ k P Ω M P Ω k F ] , (E.5) = P [ k 0 ≥ k ] E [ k P Ω M P Ω k F ] , (E.6) ≥ 1 2 E [ k P Ω M P Ω k F ] . (E.7) Hence, w e hav e E [ k P Ω M P Ω k F ] ≤ 2 E [ k Λ M Λ k F ] . (E.8) Similar to [T ro08], for each i, j , let M ij ∈ R n × n b e a matrix whose ( i, j ) entry is equal to the ( i, j ) en try of M , and whose other en tries are equal to zero. W rite E [ k Λ M Λ k F ] = E h    X i>j λ i λ j ( M ij + M j i )    F i (E.9) In tro duce an indep enden t sequence of Bernoulli random v ariables η 1 , . . . , η n , eac h taking on v alue 1 with probabilit y 1 / 2, and write E h    X i>j λ i λ j ( M ij + M j i )    F i = 2 E Λ h    E η h X i>j  η i (1 − η j ) + η j (1 − η i )  λ i λ j ( M ij + M j i ) i    F i ≤ 2 E Λ E η h    X i>j  η i (1 − η j ) + η j (1 − η i )  λ i λ j ( M ij + M j i )    F i , (E.10) = 2 E η E Λ h    X i>j  η i (1 − η j ) + η j (1 − η i )  λ i λ j ( M ij + M j i )    F i . (E.11) 34 In (E.10), w e used Jensen’s inequality to pull the exp ectation out of the norm. Now, there must be at least one sequence η ? suc h that the right hand side of (E.11) is no smaller than its exp ectation o ver η . Let T ⊂ [ n ] b e the indices of the nonzeros in η ? , and let T c b e its complement. Then com bining (E.8) and (E.11), we ha v e E h k P Ω M P Ω k F i ≤ 4 E Λ h    X i>j  η ? i (1 − η ? j ) + η ? j (1 − η ? i )  λ i λ j ( M ij + M j i )    F i (E.12) = 4 E Λ h    X i ∈ T ,j ∈ T c λ i λ j ( M ij + M j i )    F i (E.13) ≤ 4 E Λ h    X i ∈ T ,j ∈ T c λ i λ j M ij    F i + 4 E Λ h    X i ∈ T ,j ∈ T c λ i λ j M j i    F i (E.14) No w, let Λ 0 b e an independent cop y of Λ . Then the ab o ve is equal to 4 E Λ , Λ 0 h    X i ∈ T ,j ∈ T c λ 0 i λ j M ij    F i + 4 E Λ , Λ 0 h    X i ∈ T ,j ∈ T c λ i λ 0 j M j i    F i ≤ 8 E Λ , Λ 0 h    n X i,j =1 λ 0 i λ j M ij    F i = 8 E Λ , Λ 0 h k Λ 0 M Λ k F i (E.15) ≤ 8 E Λ  E Λ 0 h k Λ 0 M Λ k 2 F i  1 / 2 (E.16) = 8 p k /n E Λ h k M Λ k F i . (E.17) Ab o ve, we hav e used the fact that the F rob enius norm does not increase when a matrix is restricted to a subset of its elemen ts to mov e from a summation ov er ( i, j ) ∈ T × T c to a summation ov er all pairs ( i, j ). W e now just hav e to mov e from the Bernoulli model back to the uniform mo del. F or a ﬁxed v alue s of k 0 (i.e., a ﬁxed num b er of nonzeros in Λ ), we can divide support( Λ ) into a = d k 0 /k e random subsets S 1 , . . . , S a of size at most k . Conditioned on k 0 = s , the marginal distribution of each S i is uniform on  [ n ] | S i |  , and hence E Λ h k M P S i k F | k 0 = s i ≤  E Ω [ k M P Ω k F ] ( i − 1) k < s 0 else . (E.18) The condition on i in the ﬁrst line ab o v e simply implies that S i is nonempt y . So, E Λ h k M Λ k F i ≤ E Λ h X i k M P S i k F i (E.19) = n X s =0 X i E Λ [ k M P S i k F | k 0 = s ] P [ k 0 = s ] . (E.20) ≤ n X s =0 j s k + 1 k E Ω [ k M P Ω k F ] P [ k 0 = s ] (E.21) = E Ω [ k M P Ω k F ] × E [ k 0 /k + 1] = 2 E Ω [ k M P Ω k F ] . (E.22) Com bining (E.17) and (E.22) completes the pro of. 35 F Pro of of Lemma 5.2 W e will establish Lemma 5.2 b y applying the matrix Chernoﬀ b ound to the extreme eigenv alues of the sum of indep enden t positive semideﬁnite matrices X X ∗ = p X j =1 x j x ∗ j . (F.1) A bit of care needs to b e tak en b ecause the summands x j x ∗ j ha ve unbounded norm; we handle this b y replacing them with a sequence of truncated terms ¯ x j ¯ x ∗ j that are equiv alen t to x j x ∗ j with very high probabilit y . Pr o of. Set ¯ x j =  x j k x j k ≤ (1 + β ) p n log p/p 0 else (F.2) w ere β > 0 is a constant to be chosen later. It is not diﬃcult to show 8 that for eac h j P " k x j k > (1 + β ) s n log p p # < p − β 2 / 2 . (F.4) So, max j k x j k is b ounded by (1 + β ) p n log p/p with probabilit y at least 1 − p 1 − β 2 / 2 . Hence, with at least this probability ¯ x j = x j ∀ j , and P j x j x ∗ j = P j ¯ x j ¯ x ∗ j . Thanks to truncation, the follo wing b ound alw ays holds: k ¯ x j ¯ x ∗ j k ≤ B . = (1 + β ) 2 n log p p . (F.5) Since x j x ∗ j  ¯ x j ¯ x ∗ j alw ays, E [ x j x ∗ j ]  E [ ¯ x j ¯ x ∗ j ], and so µ max . =    E h X j ¯ x j ¯ x ∗ j i    ≤    E h X j x j x ∗ j i    = k I k = 1 . (F.6) Plugging in to the b ound (B.3), w e ha v e P h λ max  X j ¯ x j ¯ x ∗ j  ≥ 1 + t i ≤ n exp  − t 2 µ max p 4 (1 + β ) 2 n log p  . (F.7) Notice that there is still a dep endence on µ max ≤ 1 in the exp onen t. This will b e resolved by dev eloping a low er b ound on µ min ≤ µ max . 8 Conditioned on Ω j , k x j k = k P Ω j v j k . = f ( v j ) is a 1-Lipschitz function of v j . F rom Jensen’s inequality , E [ f ( v j )] ≤ p E [ f 2 ( v j )] ≤ p n/p . Hence, from Gaussian measure concentration, P [ f ( v j ) ≥ E [ f ( v j ) | Ω j ] + t | Ω j ] ≤ exp  − t 2 kp 2 n  . (F.3) W e set t = β p n log p/p and then use the fact that the b ound holds for all Ω j to remov e the conditioning, giving (F.4). 36 The smallest eigen v alue requires a bit more w ork, since it decreases under truncation: µ min . = λ min  E h X j ¯ x j ¯ x ∗ j i (F.8) ≥ λ min  E h X j x j x ∗ j i −    E h X j ¯ x j ¯ x ∗ j − x j x ∗ j i    (F.9) ≥ λ min  E h X j x j x ∗ j i − X j E h    ¯ x j ¯ x ∗ j − x j x ∗ j    i (F.10) = λ min ( I ) − X j E h   x j x ∗ j   1 k x j k > √ B i (F.11) = 1 − X j E h k x j k 2 2 1 k x j k > √ B i (F.12) ≥ 1 − X j q E [ k x j k 4 2 ] q E [( 1 k x j k > √ B ) 2 ] (F.13) = 1 − p q E [ k x 1 k 4 2 ] q P [ k x 1 k > √ B ] (F.14) ≥ 1 − p × √ 3 n/p × p − β 2 / 4 = 1 − √ 3 np − β 2 / 4 . (F.15) Ab o ve, in (F.10) we hav e used Jensen’s inequalit y , while in (F.11) w e hav e used the deﬁnition of ¯ x j . In (F.13) w e hav e used the Cauch y-Sc hw arz inequality . The manipulations are completed using the fact that for Z ∼ N (0 , σ 2 ), E [ Z 4 ] = 3 σ 4 to giv e the following bound on E k x 1 k 4 : E  k x 1 k 4 2  = E "  X a ∈ Ω 1 V 2 a 1  2 # = X a,b ∈ Ω 1 E  V 2 a 1 V 2 b 1  = k ( k − 1) σ 4 + k E [ V 4 11 ] = ( k 2 + 2 k ) σ 4 ≤ 3 k 2 σ 4 = 3 n 2 /p 2 . F rom the ab o ve, write µ min ≥ 1 − g ( p ). Then T ropp’s b ound giv es P h λ min  X j ¯ x j ¯ x ∗ j  < 1 − t i ≤ n exp  − ( t − g ( p )) 2 2( β + 1) 2 p n log p  . (F.16) F or concreteness, w e choose β = 4. Then provided p > ( C n/t ) 1 / 4 , g ( p ) < t/ 2 < 1 / 2, completing the pro of. 37

On the Local Correctness of L^1 Minimization for Dictionary Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment