Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors
We consider two problems that arise in machine learning applications: the problem of recovering a planted sparse vector in a random linear subspace and the problem of decomposing a random low-rank overcomplete 3-tensor. For both problems, the best kn…
Authors: Samuel B. Hopkins, Tselil Schramm, Jonathan Shi
Fast spectral algorithms fr om sum-of-squares pr oofs: tensor decomposition and planted sparse vectors Samuel B. Hopkins ∗ T selil Schramm † Jonathan Shi ‡ David Steurer § February 4 , 2016 Abstract W e consider two problems that arise in ma chine learning applications: the problem of recovering a planted sparse vector in a ra ndom linear subspace and the p roblem of decomposing a random low-rank o vercomplete 3-tensor . For both problems, the best known guarantee s are based on the sum-of-squares method. W e develop new a lgorithms inspired by analyses of the sum-of-squares method. Our algorithms achieve the same or similar gua r antees as sum-of- squares for these problems but the running time is significantly faster . For the planted sparse vector problem, we give an algorithm with running time nearly linear in the input size that approximately recovers a p la nted spa rse vector with up to constant relative sparsity in a random subspace of R n of dimension up to ˜ Ω ( √ n ). Th ese r ecovery gua rantees match the best known ones of Ba rak, Kelner , and Steurer (STOC 2014) up to logarithmic fac tors. For tensor decompositio n, we give an algorithm with run ning time close to linear in the input size (with exponent ≈ 1 . 086) that approximately recovers a component of a random 3- tensor over R n of rank up to ˜ Ω ( n 4 / 3 ). The best previous algorithm for this problem due to Ge and Ma (RANDOM 201 5) works up to ra nk ˜ Ω ( n 3 / 2 ) b ut requires quasipolynomi al time. ∗ Cornell Univer sity . samhop@cs.cornell. edu . Supported by an NSF Graduate Research Fellowship (NSF award no. 114415 3) and by David Steurer ’s NSF CAREER award. † UC Ber keley , tschramm@cs.b erkeley.edu . Supported by an NSF Graduate Re s earch Fell owship (NSF award no 110640 0). ‡ Cornell University , jshi@cs.cor nell.edu . Supported by David Steurer ’s NSF CAREER award. § Cornell University , dsteurer@cs.corn ell.edu . Supported by a Microsoft Rese arch Fello wship, a Alfred P . Sloan Fellowship, an NSF CAREER award, and the Simons Coll aboration for Algorithms and Geometr y . Contents 1 Introduction 1 1.1 Planted sparse vector in random linear subspace . . . . . . . . . . . . . . . . . . . . . 2 1.2 Over complete te n s or decompo sition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 T ensor p rincipal compo n e nt analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 T e chniques 6 2.1 Planted sparse vector in random linear subspace . . . . . . . . . . . . . . . . . . . . . 8 2.2 Over complete te n s or decompo sition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 T ensor p rincipal compo n e nt analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 Preliminarie s 13 4 Planted sparse vector in random linear subspace 14 4.1 Algorithm succeeds on good basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Closeness of input basis and good basis . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 Overcomplete tensor decomposition 20 5.1 Proof of The orem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 Spectral gap for diagonal terms: pr oof of P roposition 5.4 . . . . . . . . . . . . . . . . 24 5.3 Bound for cross terms: proof of Proposition 5.5 . . . . . . . . . . . . . . . . . . . . . . 28 5.4 Full algorithm and proof of Theo rem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6 T e nsor principal component analysis 37 6.1 Spiked tens or model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2 Linear- time algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Reference s 39 A Additional preliminarie s 43 A.1 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 A.2 Concentration tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 B Co ncentration bounds for planted sparse vector in random linear subspace 51 C Concentration bou nds for overcomplete te nsor decomposition 54 D Concentration boun ds for te nsor principal component analysis 61 1 Introduction The sum-of-squar es (SoS) method (also known as the Lasserre hierar chy) [ Sho87 , Par00 , Ne s 00 , Las01 ] is a powerful, s emidefinite-programming based meta-algorithm that applies t o a wide- range of o ptimization problems. The method has bee n studied ext e nsively for moderate -size polynomial optimiz ation pr oblems that arise for e xample in control theory and in the conte x t of approxima tion algorithms for combinatorial optimization problems, espe ciall y constraint satisfac- tion and graph p artitioning (see e.g. the survey [ BS14 ]). For the latter , the SoS method captur es and generalizes the best kno w n approximation algorithms based on linear pr ogramming (LP), semidefinite programmi ng (SDP), or spectral methods, and it is in many cases the mos t promising approac h to obtain algorithms with bette r guarantees —especially in t h e cont ext of Khot’s Unique Games Conjecture [ BBH + 12 ]. A s e quence of recent work s applies t h e s um-of-squar es meth o d to basic problems that arise in unsupervised machine learning: in particular , r ecovering sparse vectors in linear subspaces and decomposing tensors in a robust way [ BKS14 , BKS15 , H SS15 , BM15 , GM15 ]. For a wide range of parameters of t hese p roblems, SoS achieves significantly stronger guarantee s than other method s, in p olynomial or quasi-polynomial time. Like other L P and SDP hierarchies, the s um-of-squar es method comes with a deg ree parameter d ∈ N t hat allows for trading o ff running time and so lution quality . This trade-o ff is appe aling because for applications the additional utility of bett e r solutions could vastly outwe igh additional computational costs. Unfortunately , the computational cost gr ows rather steep ly in t erms of the parameter d : the running time is n O ( d ) where n is t he number of variables (usually comparable t o the instance s ize). F urther , even when the SDP has size polynomial in the input (when d = O (1)), solving the und erlying semidefinite p rograms is prohibitiv ely slow for large inst ances . In this work , we introduce sp ectral algorithms for planted s p arse vector , tenso r decompo s ition, and tenso r principal component s analysis (PCA) that exploit the same high-d egree information as the corresponding sum-of-squares algorithms without relying on s emidefinite programming, and achieve t he same (or close t o the same) guarantees. The r esulting algorithms ar e quite simple (a couple o f lines of m a tlab code ) and have cons ide rably faster running times—quasi-linear or close to linear in the input s ize. A surp r ising implication o f our work is that for some problems, spe ctral algorithms can exploit information from larger values of t he parameter d witho u t s pending time n O ( d ) . For e x ample, one of our algorithms runs in nearly-linear time in the input size, even though it use s properties that the s u m-of-squar es met h o d can only use for degree parameter d > 4. (In particular , t h e guarantee s that the al gorithm ac hieves ar e strictly s tronger than the guarantees t hat SoS ac hieves for values of d < 4.) The initial successe s of SoS in the machine learning s etting gave hope th at techniques de- veloped in the theor y of approxim ation algorithms , sp ecifically the t echniques of hier archies o f convex rela xations and rounding convex relaxations, could broadly impact the practice of machi ne learning. This hope was dampened by the fac t that in gene ral, algorithms t h at rely on solving lar ge semide fin ite programs ar e too slow to be practical for the large-scal e pr oblems t hat arise in machine learning. Our work brings this hope back into fo cus by d e monstrating for the firs t t ime that with some care S oS algorithms can be made practical for large-scale problems. 1 In the following subsections we describe each of the problems that we consider , the prior best-known guarantee via t h e So S hierarchy , and our results. 1.1 Planted sparse vector in random linear subspace The probl em of finding a sparse vector planted in a random linear subspace was introduced by Spielman, W ang, and W right as a way of learning sparse dictionaries [ SWW12 ]. Subse quent works have fo u nd further applications and begun study ing the problem in its own right [ DH14 , BKS 14 , QSW14 ]. In this problem, we are given a basis fo r a d -dimensional l inear subspace of R n that is random excep t for one pla nted sp arse direction, and the g oal is to recover this s parse direction. The computational challenge is to solve this problem e ven whe n the p lanted vecto r is only mildly sparse (a constant fraction of non -zero coordinates) and the subspace dimension is lar ge compared to the ambient d imension ( d > n Ω (1) ). Several kinds of algorithms have been proposed for this problem based on linear programming (LP), ba sic semidefinite programming (SDP), sum-of-squares, and non-convex gradient descent (alternating directions method). An inherent lim itation o f simpler convex metho ds (LP and basic SDP) [ SWW12 , dGJL04 ] is that they requir e the relativ e sparsity of th e planted vector to be polynomial in the subs p ace dimension (less t han n / √ d no n-zero coordina tes). Sum-of-squares and n o n-convex methods do not share this limitation. T hey can recover p lanted vectors with cons tant relative sparsity even if the subspace has polynomial dimension (up to d i- mension O ( n 1 / 2 ) for s u m-of-squar es [ BKS14 ] and up to O ( n 1 / 4 ) for non-convex met hods [ QSW14 ]). W e s t ate the problem formally: Problem 1.1 (Planted sparse vector problem w ith ambient dimension n ∈ N , subspace d imension d 6 n , sp arsity ε > 0, and accuracy η > 0) . Giv en an arbitrary ort hogonal basis of a subspace spanned by vectors v 0 , v 1 , . . . , v d − 1 ∈ R n , where v 0 is a vector w ith at most ε n non-zero entries and v 1 , . . . , v d − 1 ar e vectors sampled indep e ndently at random fr om the standard Gaussian distribution on R n , ou tput a unit vecto r v ∈ R n that has corr elation h v , v 0 i 2 > 1 − η with the sparse vector v 0 . Our Results. Our algorithm runs in nearly linear time in the input size, and matches the best- known guarantees up to a polylogarithmic factor in t he s ubspace d imension [ BKS14 ]. Theorem 1.2 (Planted sparse vector in n e arly-linear time) . T her e exists an algorithm that, for every sparsi ty ε > 0 , ambient dimension n , and subspace dimension d with d 6 √ n / (log n ) O (1) , solves the planted sparse vector pr oblem with high proba bility for some accuracy η 6 O ( ε 1 / 4 ) + o n →∞ (1) . The running time of the algorit hm is ˜ O ( nd ) . W e g ive a t e chnical ove r view of the proof in Section 2 , and a full proof in S e ction 4 . Previ ous w o rk also show e d how to recover the planted s parse vector exactly . The task of g o ing fr om an approxima te solution to an exact one is a special case of standard compressed sensing (see e.g. [ BKS14 ]). 2 T able 1: Comparison of algo rithms for the planted sparse vector problem with ambient dimension n , subspace dimension d , and relative sparsity ε . Reference T echnique Runtime Largest d Largest ε Demanet, H and [ DH14 ] linear programming poly any Ω (1 / √ d ) Barak, Ke lner , Steurer [ BKS14 ] SoS, general SDP poly Ω ( √ n ) Ω (1) Qu, Sun, W right [ QSW14 ] alternating minimization ˜ O ( n 2 d 5 ) Ω ( n 1 / 4 ) Ω (1) this wo rk SoS, partial traces ˜ O ( nd ) ˜ Ω ( √ n ) Ω (1) 1.2 Overcomplete tensor decomposition T ensors naturall y represent multilinear relationships in data. Algorithms for tens or d e compo- sitions have long been stud ied as a t ool for data analysis across a wide-range of disciplines (see the early work of Harshman [ Har70 ] and the survey [ KB09 ]). While the problem is N P - har d in t he worst -case [ Hås90 , HL13 ], algorithms fo r special cases of tens o r decomposition have r ecently led to ne w pr ovable algorithmic results for several unsupervised learning prob- lems [ AGH + 14 , BCMV14 , GVX14 , A GHK14 ] including inde pendent component analysis, learning mixtures of Gaussians [ GHK15 ], Latent Dirichlet topic mode ling [ AFH + 15 ] and dicti onary learn- ing [ BKS15 ]. Some p revi ous learning algorithms can also be reinterpreted in ter ms of tensor decomposition [ Cha96 , MR 06 , NR09 ]. A key algorithmic challenge for tensor decompo sitions is over completeness , when the number of compone nts is larger than their dimens ion (i.e., t he component s are linearly de p endent). Mos t algorithms that w ork in this regime require tens ors of order 4 or higher [ LCC07 , BCMV14 ]. For example, the FOOBI al gorithm of [ LCC07 ] can recover up to Ω ( d 2 ) compone nts given an order- 4 tensor in dimension d und er mild algebraic inde penden ce assumptions for the compo nents— satisfied with high proba bility by random component s. F or overcomplete 3-tensors, which arise in many applicati ons of t ensor d ecompositions, such a result remai ns elusive. Researchers have therefore turne d to invest igate average -case versions of the problem, when the components of t he o verc omplete 3-tensor ar e random: Given a 3-tensor T ∈ R d 3 of the form T = n X i = 1 a i ⊗ a i ⊗ a i , where a 1 , . . . , a n ar e rando m unit or Gauss ian vectors , the goal is t o appr oximately recover the components a 1 , . . . , a n . Algorithms based on tensor power iteration —a gradient-descent approach for tensor decomposition—s olve this problem in po lynomial time w h e n n 6 C · d for any constant C > 1 (the r unning time is expo n e ntial in C ) [ AGJ15 ]. T ensor power iteration also admi ts local conver - gence anal yses for up to n 6 ˜ Ω ( d 1 . 5 ) components [ AGJ15 , AGJ14 ]. U nfortunately , these analyse s do not give polynomial-time algorithms because it is not know n how to e ffi ciently obtain the kind of initializ ations assumed by the analyses . Recently , Ge and Ma [ GM15 ] were abl e to show that a tens or-decomposition algorithm [ BKS15 ] based on sum-of-squares solves the above p roblem for n 6 ˜ Ω ( d 1 . 5 ) in quasi-polynomial time n O (log n ) . The key ingredient o f their elegant analysis is a subtle spectral concent ration bound for a partic- ular degree-4 matrix-val ued p olynomial associated with t h e de composition probl em of random 3 T able 2: Co mparison of decomposition algor i thms for overcomplete 3-tensor s with n components in di me nsion d . Reference T echnique Runtime Largest n Components Anandkumar e t al. [ AGJ15 ] a tensor power ite ration poly C · d incoherent Ge, Ma [ GM15 ] SoS, general SDP n O (log n ) ˜ Ω ( d 3 / 2 ) N (0 , 1 d Id d ) this work b SoS, partial t races ˜ O ( nd 1 + ω ) ˜ Ω ( d 4 / 3 ) N (0 , 1 d Id d ) a The analysis shows that for every constant C > 1, the running time is polynomial for n 6 C · d components. The analysis assumes that the components also s atisfy other random-like proper ties besides incoherence. b Here, ω 6 2 . 373 i s the constant so that d × d matrices can be multipli ed in O ( d ω ) arithmetic operations. overcomplete 3-tensors . W e s t ate the problem formally: Problem 1.3 (Random tensor de composition with dimens ion d , rank n , and accuracy η ) . Let a 1 , . . . , a n ∈ R d be independe n t ly sampled vectors from the Gaussian distribution N (0 , 1 d Id d ), and let T ∈ ( R d ) ⊗ 3 be the 3-tens or T = P n i = 1 a ⊗ 3 i . Single compon ent: Given T s ampled as above, find a unit vector b t h at has correlation max i h a i , b i > 1 − η with one of the vectors a i . All components: Given T sampled as above, find a set o f unit vecto r s { b 1 , . . . , b n } such that h a i , b i i > 1 − η for every i ∈ [ n ]. Our Results . W e give the first polynomial-time algorithm for decompo sing random overcom- plete 3-tensors w ith up to ω ( d ) components. Our algorithms works as long as the number o f components satisfies n 6 ˜ Ω ( d 4 / 3 ), which comes close t o t he bound ˜ Ω ( d 1 . 5 ) achieved by the afore- mentioned quasi-polynomial algorithm of Ge and Ma. For the single-component version of the problem, our algorithm runs in t ime close t o linear in th e input size. Theorem 1.4 (Fast random tensor decompos ition) . There exist randomized algorithms th at, for every dimension d an d rank n w ith d 6 n 6 d 4 / 3 / (log n ) O (1) , solve the random tensor decompos ition pro blem with pr obabili ty 1 − o (1) for some accuracy η 6 ˜ O ( n 3 / d 4 ) 1 / 2 . The running time for the single-component version of the pr oblem is ˜ O (min { d 1 + ω , d 3 . 257 } ) , w her e d ω is the time to multiply two d-by-d matrices. The running time for the all-comp onents version of the pro blem is ˜ O ( n · min { d 1 + ω , d 3 . 257 } ) . W e g ive a t e chnical ove r view of the proof in Section 2 , and a full proof in S e ction 5 . W e remark that the ab ove al gorithm only r equir es access to the input tensor with some fixed inverse p olynomial accuracy because each of its four steps ampli fies errors by at most a po lyn o mial factor (see Algorithm 5.17 ). In this sense , the algorithm is robust. 1.3 T ensor principal component analysis The problem o f tensor principal component analysis is similar to the te nsor decomposition problem. However , her e the focus is not on the number o f components in the te nsor , but ab out recovery in the presence of a la rge amount of random noise. W e are g iven as input a tens or τ · v ⊗ 3 + A , where v ∈ R n is a u nit vector and the entries of A ar e chos en ii d fr om N (0 , 1). This spiked te nsor model was intr oduced by Montanari and Richard [ RM14 ], who also o btained the first algorithms 4 T able 3: Comparison of alg orithms for pr incipal component analysi s of 3-tensors i n dime nsion d and with signal-to-noise ratio τ . Reference T echnique Runtime Smallest τ Montanari, Richar d [ RM14 ] spectral ˜ O ( d 3 ) n Hopkins, Shi, St eurer [ HSS15 ] SoS, sp ectral ˜ O ( d 3 ) Ω ( n 3 / 4 ) this wo rk SoS, partial t races O ( d 3 ) ˜ Ω ( n 3 / 4 ) to s olve the model with provabl e statistical gu arantees. The s piked te nsor mode l was subsequently addressed by a subse t of the present authors [ HSS15 ], who applied the SoS approach to improve the signal-to-noise ratio requir ed for recovery from o dd-order tenso rs. W e s t ate the problem formally: Problem 1.5 (T ensor principal components analysis with signal-to-noise ratio τ and accuracy η ) . Let T ∈ ( R d ) ⊗ 3 be a te nsor so that T = τ · v ⊗ 3 + A , where A is a tensor with inde penden t standard gaussian entries and v ∈ R d is a unit vector . Given T , r ecover a unit vector v ′ ∈ R d such that h v ′ , v i > 1 − η . Our res ults. Fo r this problem, our improvements o ver the previous results are mor e modest—we achieve signal-to-noise guarantee s matching [ HSS15 ], but with an algorithm that runs in linear time rather than ne ar -linear time (time O ( d 3 ) rather t han O ( d 3 polylog d ), for an input of size d 3 ). Theorem 1.6 (T ensor principal compone nt analysis in linear time) . There is an algorit hm which solves the tensor principal component analysis pr oblem with accuracy η > 0 whenever the signal-to-noise ratio satisfies τ > O ( n 3 / 4 /η · log 1 / 2 n ) . Furthermor e, the algorith m runs in time O ( d 3 ) . Though for tensor PCA our improvement over p revious work is modes t, we incl ude t he r esults here as this problem is a pedago gically poignant illustration of our techniques. W e g ive a technical overview o f t he proof in Section 2 , and a full proof in Section 6 . 1.4 Related work Foremost, t his work builds u p on the SoS algorithms of [ BKS14 , BKS15 , GM15 , HSS 15 ]. In each o f these previous works, a machine learning decision problem is solved using an SDP relaxation for SoS. In t hese works, the S DP value is large in the yes case and small in the no case, and the SDP value can be bounde d using the spectrum of a specific matrix. This was implicit in [ BKS14 , BKS15 ], and in [ HSS15 ] it was used to obtain a fast algorithm as well. I n our work, w e design spe ctral algorithms which use smaller matrices, inspired by the SoS certificates in previous wo rks, to solve these machine-learning problems much faster , w ith almost matching guarantee s. A key ide a in our wo r k is th at given a la rge matrix with information encoded in the matrix’s spectral gap, one can ofte n e ffi ciently “compr ess” the matrix to a much smaller one without los ing that information. This is particularly true for problems w ith planted s olutions. In t his way , we are able to impr ove running time by r eplacing an n O ( d ) -sized SDP with an eigenvector computation for an n k × n k matrix, fo r so me k < d . 5 The idea of speed ing up LP and SDP hierarchi es for s pecific problems has been investigated in a series of previous works [ dlVK07 , BRS11 , GS12 ], which shows that with respect to local analyses of the sum-of-squares algorithm it is s ometimes possible to improve t h e running time from n O ( d ) to 2 O ( d ) · n O (1) . However , the scop es and s trategies of the se wor k s are complete ly di ff erent fr om ours. First, t he notion of local analysis from these work s does no t app ly to the problems considered her e. Second, thes e w o rks employ the ellipsoid method with a se paration oracle insp ired by rounding algorithms, wh e reas we reduce the problem to an ordinary eige nvector computation. It would also be interesting t o s ee if our methods can be us ed to speed u p s ome of the other recent s u ccessful applications of SoS to machi ne-learning type probl ems, such as [ BM15 ], or the application of [ BKS14 ] to tensor decomp o sition with compo nents that are we ll-separated (rather than random). Finall y , we wou ld be remiss not to mention that SoS lower bounds exist for s everal of these problems, spe cificall y for tensor principal components anal ysis, tensor p rediction, and sparse PCA [ HSS15 , BM15 , MW15 ]. The lowe r bounds in the So S framework are a go od indication that we cannot e xpect sp ectral algorithms achieving bet ter guarante e s. 2 T echniques Sum-of-squares method (for polynomial optimiza tion over t he sphere). The problems we con- sider are connected to optimization problems of the following form: Given a homogeneo us n - variate real po lynomial f o f constant degree, find a unit vector x ∈ R n so as to maximize f ( x ). The sum-of-squares method allows us to e ffi ciently comput e up per bound s on the maximum value of such a polynomial f over the unit sphe re. For the case that k = deg( f ) is even, the most basic u p per bound o f this kind is the lar gest eigenvalue of a matrix re pr esentati on of f . A matrix rep r esentation of a po lynomial f is a symmet ric matrix M with rows and columns indexed by monomials of deg ree k / 2 s o t hat f ( x ) can be written as t he quadratic form f ( x ) = h x ⊗ k / 2 , Mx ⊗ k / 2 i , where x ⊗ k / 2 is the k / 2-fold tensor po wer o f x . T he lar gest eigenvalue of a matrix representation M is an upp e r bound o n the value of f ( x ) over unit x ∈ R n because f ( x ) = h x ⊗ k / 2 , Mx ⊗ k / 2 i 6 λ max ( M ) · k x ⊗ k / 2 k 2 2 = λ max ( M ) . The sum-of-squares methods improves on this basic s pectral bound sys tematically by asso- ciating a lar ge family of p olynomials (potentially of degree highe r than deg( f )) w ith the input polynomial f and computing t he best p ossible spectral bound within t his family o f polynomials. Concr etely , the sum-of-squares method with d egree parameter d app lied to a polynomial f with deg( f ) 6 d consider s t h e a ffi ne subspace of polynomials { f + (1 − k x k 2 2 ) · 1 | deg( 1 ) 6 d − 2 } ⊆ R [ x ] and minimi zes λ max ( M ) among all matrix repr esentations 1 M of polynomials in t his space. 2 The problem of sear ching through this a ffi ne linear s pace of polynomials and t heir matrix r epresen- 1 Earlier we defined matrix representat ions only for homogeneous polynomials of even degree. In gener al, a matrix representation of a polynomial 1 i s a symme tr i c matrix M with rows and columns indexed by monomials of degree at most ℓ = deg( 1 ) / 2 such that 1 ( x ) = h x ⊗ 6 ℓ , Mx ⊗ 6 ℓ i (as a polynomial id e ntity), where x ⊗ 6 ℓ = 1 √ ℓ + 1 ( x ⊗ 0 , x ⊗ 1 , . . . , x ⊗ ℓ ) is the vector of all monomials o f d egree at most ℓ . Note that k x ⊗ 6 ℓ k = 1 for all x with k x k = 1. 2 The name of the method s tems from the fact that this l ast step is equivalent to finding the minimum number λ such that the space contains a polynomial of the form λ − ( 1 2 1 + · · · + 1 2 t ), where 1 1 , . . . , 1 t are polynomials of degree at mos t d / 2. 6 tations and finding the one of smallest maximum eigenvalue can be solved using semidefinite programmi ng. Our approac h for faster algorithms ba sed on SoS algorithms is to cons truct specific ma trices (polynomials) in this a ffi ne linear space, then compute the ir to p eigenvectors. By designing our matrices car efully , we ensure that our algorithms have access to t h e same higher degree information that the s um-of-squar es algorithm can access, and this information a ff or ds an advantage over t he basic spectral method s for these p robl ems. At the s ame time, our algorithms avoid searchi ng for the best polynomial and matrix r epresentation, which gives us faster running times since we avoid s e midefinite p rogrammi ng. This approac h is well suited to average-case problems wher e we avoid t he p robl em o f adversarial choice of input; in particular it is applicable to machine learning problems where noise and inputs are assumed to be random. Compressing matrices with parti al tra ces. A se rious limitation of the above approach is that the repr esentation o f a degree- d po lyn o mial requir e s size roughly n d . Hence, even avoiding the use of semidefinite p rogrammi ng, improving upo n running time O ( n d ) requir es additional ideas. In each of the pr oblems that we cons ider , we have a lar ge matrix (suggest ed by a SoS algorithm) with a “signal ” planted in so me amount of “noise”. W e sho w t h at in some s ituations, this lar g e matrix can be compressed significantly without loss in the sign al by applying partial trace o p erations. In these situations , the partial trace yields a smaller matrix with t he same signal-to-noise ratio as the lar ge matrix suggest e d by the SoS algorithm, even in situations when lower de gree sum-of-squares approac hes are know n to fail. The partial trace T r R d : R d 2 × d 2 → R d × d is the linear operator that s atisfies T r R d A ⊗ B = (T r A ) · B for all A , B ∈ R d × d . T o s ee how the p artial trace can be used to compress lar ge matrices to smaller ones with little loss, conside r the following problem: Given a matrix M ∈ R d 2 × d 2 of the form M = τ · ( v ⊗ v )( v ⊗ v ) ⊤ + A ⊗ B for some unit vector v ∈ R d and matrices A , B ∈ R d × d , we wish to recover the vector v . (This is a simplified version of the planted problems that we consider in this paper , where τ · ( v ⊗ v )( v ⊗ v ) ⊤ is the signal and A ⊗ B plays the role of noise .) It is straightforward to see that t he matrix A ⊗ B h as spectral norm k A ⊗ B k = k A k · k B k , and s o when τ ≫ k A kk B k , the matrix M has a not iceable spectral g ap, and t h e top eigenvector of M will be close to v ⊗ v . I f | T r A | ≈ k A k , the matrix T r R d M = τ · vv ⊤ + T r( A ) · B has a matching spe ctral gap, and we can still recover v , but now we only need to compute the top eigenvecto r of a d × d (as oppose d to d 2 × d 2 ) matrix. 3 If A is a W igne r matrix (e.g. a symmetr ic matrix with iid ± 1 entries ), then both T r( A ) , k A k ≈ √ n , and the above condition is indee d met . In our average case / machine learning sett ings t he “noise” component is not as simple as A ⊗ B with A a W ign e r matrix. None theless, we ar e abl e to ensure that the noise displays a simila r behavi or under partial trace ope r ations . In some cases, this r equir es additional algorithmic steps, such as random projection in the case of tensor decompos ition, or centering the matrix eigenvalue dist ribution in the case o f the planted sp ars e vector . It is an interesting question if there are general theorems de scribing the behavior of spectral norms und er partial trace operations. I n t h e current wor k , we compute the partial t r aces explicitly and estimate the ir norms d ir ectly . Indee d, our analyse s boil down to concent rations bounds for 3 In s ome of our applications, the matrix M i s only represented impl icitly and has size super linear in the size of the input, but nevertheless we can compute the top eig envector o f the partial trace T r R d M in nearly-linear time. 7 special matrix polynomials. A general theory for the concentration of matrix polyno mials is a notorious open p roblem (see [ MW13 ]). Partial t race operations have previously been app lied fo r rounding S oS relaxations. Specifically , the operation of re weighing and conditioning , us ed in rounding algorithms for sum-of-squares such as [ BRS11 , RT12 , BKS 14 , BKS15 , LR15 ], corr esponds to applying a partial trace o peration to the moments matrix returned by the sum-of-squares relaxation. W e now give a technical overview of o ur algorithmic approach for each probl em, and some br oad strokes of the analysis for each case. Our most substantial improvements in runtime ar e for the planted sparse vector and overcomplete tensor d ecomposition problems ( Section 2.1 and Section 2.2 respectively). Our algorithm for te nsor PCA is th e s imples t application of our techniques, and it may be instructive to skip ahead and read about te n s or PCA first ( Section 2.3 ). 2.1 Planted sparse vector in random linear subspace Recall that in this problem we are given a linear subspace U (repr esented by s o me basis) that is spanned by a k -sparse unit vector v 0 ∈ R d and random unit vectors v 1 , . . . , v d − 1 ∈ R d . The goal is to recover the vector v 0 approxima tely . Background and SoS analysis. Let A ∈ R n × d be a matrix whose columns form an ortho n o rmal basis for U . Our starting point is the polynomial f ( x ) = k Ax k 4 4 = P n i = 1 ( Ax ) 4 i . Pr evious work sho w ed that for d ≪ √ n the maximiz er of this polynomial over the sp h e re corr espond s to a vector close to v 0 and that de gree-4 sum-of-squares is able to capture t h is fact [ BBH + 12 , B KS 14 ]. Indeed, ty pical random vector s v in R n satisfy k v k 4 4 ≈ 1 / n whereas our planted ve cto r satisfie s k v 0 k 4 4 > 1 / k ≫ 1 / n , and this degree-4 information is leveraged by the SoS algorithms. The po lynomial f has a convenient matrix r epresentation M = P n i = 1 ( a i a ⊤ i ) ⊗ 2 , where a 1 , . . . , a n ar e t he rows o f the ge nerator matrix A . It turns ou t that the eigenvalues of this matrix inde ed give information about the planted sparse vector v 0 . In particular , the vector x 0 ∈ R d with Ax 0 = v 0 witnesses that M has an eigenvalue of at least 1 / k because M ’s quadratic form with t he vecto r x ⊗ 2 0 satisfies h x ⊗ 2 0 , Mx ⊗ 2 0 i = k v 0 k 4 4 > 1 / k . If w e let M ′ be the corr esponding ma trix for the subspace U without the p lanted s parse vector , M ′ turns out to h ave o nly eigenvalues of at most O (1 / n ) up t o a single s purious e ige nvalue with eige nvector far from any vector of t h e form x ⊗ x [ BBH + 12 ]. It follows that in order to dist inguish betwe en a random subsp ace with a planted sp arse vector ( yes case) and a complete ly random s ubspace ( no case), it is enough to compute the s econd-largest eigenvalue o f a d 2 -by- d 2 matrix (representing the 4-norm p olynomial over the su bspace as above). This decision version of the p robl em, while strictly speaking easier than the search vers ion above, is at t he heart of the matter: one can show that t he lar ge eigenvalue for the yes case corresponds t o an e igenvector which encod es the coe ffi cients of th e s parse planted vector in the basis. Improvements. The best running time we can h o pe for with this basic approach is O ( d 4 ) (the s ize of the matrix). Since we are interested in d 6 O ( √ n ), the r esulting running time O ( nd 2 ) w ould be subquadratic but st ill super-li near in the input size n · d (for r epresenting a d -dimensional subspace of R n ). T o speed things up, we use the partial trace approach outlined above. W e will begin by applying the partial trace approach naively , obtaining r easonable bounds, and then show t hat a small mod ification to the matrix before the partial trace operation allows us t o achieve even smaller signal-to-noise ratios. 8 In the planted case, we may appr oximate M ≈ 1 k ( x 0 x ⊤ 0 ) ⊗ 2 + Z , where x 0 is the vector of coe ffi cients of v 0 in t h e basis representation given by A (so that Ax 0 = v 0 ), and Z is the noise matrix. Since k x 0 k = 1, the partial trace ope r ation preserves the projector ( x 0 x ⊤ 0 ) ⊗ 2 in t he s ense that T r R d ( x 0 x ⊤ 0 ) ⊗ 2 = x 0 x ⊤ 0 . Hence, with our heurist ic approximation for M above, we could show that the top eige nvector o f T r R d M is close to x 0 by showing t hat the spectral norm bou nd k T r R d Z k 6 o (1 / k ). The partial trace o f o ur matrix M = P n i = 1 ( a i a ⊤ i ) ⊗ ( a i a ⊤ i ) is easy to comput e d ir ectly , N = T r R d M = n X i = 1 k a i k 2 2 · a i a ⊤ i . In t he yes case (random subs p ace with planted sparse vector), a direct compu tation s h o ws that λ yes > h x 0 , N x 0 i ≈ d n · 1 + n d k v 0 k 4 4 > d n 1 + n dk . Hence, a natural approach to distinguish between the yes case and no case (completely random subspace) is to up per bound the sp ectral norm of N in the no case. In order to simplify the boun d on the spectral norm of N in t h e no case, su ppose that th e columns of A are iid samples from the Gaussian dist r ibution N (0 , 1 d Id) (rather than an orthogon al basis for the random s u bspace)– Lemma 4.6 es tablishes that this s implification is leg itimate. I n this simplified se tup, the matrix N in th e no case is the sum of n iid matrices {k a i k 2 · a i a ⊤ i } , and w e can upper bound its spe ctr al norm λ no by d / n · (1 + O ( √ d / n )) using standard matrix concent r ation bounds. H e nce, using t he spe ctr al norm of N , we will be able to d istinguish betwe en t he ye s case and the no case as long as p d / n ≪ n / ( dk ) = ⇒ λ no ≪ λ yes . For linear s parsity k = ε · n , this inequality is t rue so long as d ≪ ( n /ε 2 ) 1 / 3 , which is somewhat worse t han the bound √ n bou n d o n the dimension that we are aiming for . Recall that T r B = P i λ i ( B ) for a symmetric matrix B . As discusse d above, the partial trace approac h w o rks best when the noise be haves as the tensor of two W igner matrices, in that t here ar e cancellations whe n the eigenvalues of t he noise are summed. In our case, the noise terms ( a i a ⊤ i ) ⊗ ( a i a ⊤ i ) do not have this property , as in fact T r a i a ⊤ i = k a i k 2 ≈ d / n . Thus, in order to improve the dimension bound, we will center the eigenvalue distribution of the noise part of the matrix. This will cause it to behave more like a W igne r matrix, in that the spectral norm of t he noise will not incr ease after a partial trace. Consider the p artial t race of a matrix o f the form M − α · Id ⊗ X i a i a ⊤ i , for some cons tant α > 0. The partial trace of this matrix is N ′ = n X i = 1 ( k a i k 2 2 − α ) · a i a ⊤ i . W e choose the constant α ≈ d / n such that our matrix N ′ has expectation 0 in t h e no case, when the subs p ace is completely rando m. I n the ye s case, t he Rayleigh quo tient of N ′ at x 0 simply shifts as compared t o N , and we have λ yes > h x 0 , N ′ x 0 i ≈ k v 0 k 4 4 > 1 / k (s e e Lemma 4.5 and sublemmas). On the other hand, in t he no case, this cente r ing op eration causes significant cancellations in the 9 eigenvalues of the partial trace matrix (instead of just shifting the e igenvalues). I n the no case, N ′ has spectral n o rm λ no 6 O ( d / n 3 / 2 ) for d ≪ √ n (using st andar d ma trix concentration bounds; again se e Lemma 4.5 and sublemmas). Therefor e, the spectral norm of the matrix N ′ allows us to distinguish betwe e n the yes and no case as long as d / n 3 / 2 ≪ 1 / k , which is satisfied as long as k ≪ n and d ≪ √ n . W e give the full formal argument in Section 4 . 2.2 Overcomplete tensor decomposition Recall that in this problem we ar e g iven a 3-tensor T of the form T = P n i = 1 a ⊗ 3 i ∈ R d 3 , w h e re a 1 , . . . , a n ∈ R d ar e independent rando m vectors from N (0 , 1 d Id). The g oal is to find a unit vector a ∈ R d that is h igh ly correlated with one 4 of the vectors a 1 , . . . , a n . Background. The starting p oint o f our algorithm is t he polynomial f ( x ) = P n i = 1 h a i , x i 3 . It turns out that for n ≪ d 1 . 5 the (approxima te) maximizers of this po lyn o mial are close to th e compo nents a 1 , . . . , a n , in the se nse that f ( x ) ≈ 1 if and only if max i ∈ [ n ] h a i , x i 2 ≈ 1. Indeed , Ge and Ma [ GM15 ] show that the sum-of-squares metho d alr eady captur es this fact at d egree 12, which implies a quasipolynomial time algorithm for this tenso r decompo sition problem via a gene ral r ounding result of Barak, Kelner , and St eurer [ BKS15 ]. The simplest approach to this problem is to consider the tens or r epresentation of th e polynomial T = P i ∈ [ n ] a ⊗ 3 i , and flatten it, hoping the singular vectors of th e flattening are corr elated with the a i . H owever , t his approa ch is doo med to failure for two reasons: firstly , the s imple fla ttenings of T ar e d 2 × d matrices, and since n ≫ d the a ⊗ 2 i collide in t he column space, so that it is impossible to deter mine Span { a ⊗ 2 i } . Se condly , even for n 6 d , because the a i ar e random vectors, t heir norms concentrate very closely about 1. This makes it di ffi cult to distinguish any o ne p articular a i even when the span is computable. Improvements. W e will try to circumv ent both of the se issues by going to higher dimensions. Suppose , for example, that we had access to P i ∈ [ n ] a ⊗ 4 i . 5 The eigen ve cto rs of the flatte nings of this matrix ar e all within Sp an i ∈ [ n ] { a ⊗ 2 i } , addressing o ur first issue, leaving us only with the trouble of extracting individual a ⊗ 2 i fr om their span. I f furthermore we had access to P i ∈ [ n ] a ⊗ 6 i , we could perform a partial random projection ( Φ ⊗ I d ⊗ Id) P i ∈ [ n ] a ⊗ 6 i where Φ ∈ R d × d is a matrix with independe nt Gaussian e ntires, and then taking a p artial trace, we end up with T r R d ( Φ ⊗ Id ⊗ Id ) X i ∈ [ n ] a ⊗ 6 i = X i ∈ [ n ] h Φ , a ⊗ 2 i i a ⊗ 4 i . W ith reasonable probabi lity (for e x p osition’s sake , s ay with p robability 1 / n 10 ), Φ is closer t o s o me a ⊗ 2 i than to all of the othe rs so that h Φ , a ⊗ 2 i i > 100 h Φ , a ⊗ 2 j i for all j ∈ [ n ], and th e n a ⊗ 2 i is distinguishable fr om the other vectors in the span of our matrix, taking care of the second issue . As we s how , a much smaller gap is s u ffi cient to dist inguish the top a i fr om the ot her a j , and so the higher- probab ility event that Φ is o nly s lightly closer to a i su ffi ces (all owing us to recover all vectors at 4 W e can then approximately recover all the components a 1 , . . . , a n by run ning independent trials o f our randomized algorithm repeatedly on the same input. 5 As the problem is defined, we assume that we do not have access to this input, and in many machine learning applications this is a valid assumption, as g athering the data necessary to ge nerate the 4th order input tensor requires a prohibitively large number o f samples. 10 an additional runtime cost of a factor of ˜ O ( n )). This discuss ion ignores the presence of a single spurious large eigenvector , which we add ress in t he technical se ctions. Of course, we do not have access t o the higher-or der tensor P i ∈ [ n ] a ⊗ 6 i . I nstead, we can obtain a no isy vers ion of this tensor . Our approach considers the following matrix representation of the polynomial f 2 , M = X i , j a i a ⊤ j ⊗ ( a i a ⊤ i ) ⊗ ( a j a ⊤ j ) ∈ R d 3 × d 3 . Alternatively , we can view this matrix as a particular flattening of the Kr onecker-squar ed te nsor T ⊗ 2 . It is instructive to d ecompose M = M diag + M cros s into its diagonal t erms M diag = P i ( a i a ⊤ i ) ⊗ 3 and its cross te rms M cros s = P i , j a i a ⊤ j ⊗ ( a i a ⊤ i ) ⊗ ( a j a ⊤ j ). The algorithm described above is alr eady successful for M diag ; we nee d only control the eige nvalues of the partial trace o f the “noise ” component, M cros s . The main technical work w ill be to s h o w t hat k T r R d M diag k is small. In fact, we will have to choose Φ fr om a somewhat d i ff erent dist r ibution—o bserving t hat T r R d ( Φ ⊗ Id ⊗ Id) = P i , j h a i , Φ a j i · ( a i ⊗ a j )( a i ⊗ a j ) ⊤ , w e will sample Φ s o that h a i , Φ a i i ≫ h a i , Φ a j i . W e give a more detailed overview of this algorithm in the beginning of Section 5 , explaining in more de tail o ur choice o f Φ and justifying heuristically t he bo u ndednes s of t h e s pectral norm o f the noise . Connection to S oS analysis. T o e xplain how the above algorithm is a spe edup of SoS, we give an overview of the So S algorithm of [ GM15 , BKS 15 ]. There, the degree- t S o S SDP program is u sed to obtain an o rder - t ten s or χ t (or a pseudodistri bution ). Informally s peaking, we can unders t and χ t as a pr oxy for P i ∈ [ n ] a ⊗ t i , so that χ t = P i ∈ [ n ] a ⊗ t i + N , wher e N is a no ise tensor . While the precise form of N is unclear , we know that N must o be y a s et of constraints impos ed by t he SoS hierarc hy at degree t . For a formal discus sion o f pse udodistributions , see [ BKS15 ]. T o extract a single component a i fr om the t ensor P i ∈ [ n ] a ⊗ t i , there are ma ny algorithms which would work (for example, the algorithm we d escribed for M diag above). However , any algorithm extracting an a i fr om χ t must be robust to the noise tens or N . For this it turns out the following algorithm will do: suppose we have the tens or P i ∈ [ n ] a ⊗ t i , taking t = O (log n ). Sample 1 1 , . . . , 1 log( n ) − 2 random un it vectors, and compute the matrix M = P i ( Q 1 6 j 6 log( n ) − 2 h 1 j , a i i ) · a i a ⊤ i . If we ar e lucky enough, there is some a i so that e ve r y 1 j is a bit cl oser to a i than any other a i ′ , and M = a i a ⊤ i + E for some k E k ≪ 1. The pr oof that k E k is s mall ca n be ma de so s imple that it applies also to the SDP-produced proxy tenso r χ log n , and so this algorithm is robust t o the noise N . This last step is very g eneral and can handle tensor s wh o se components a i ar e les s well-behaved than the random vectors we consider , and also more overcomplete, handling tensors of rank up to n = ˜ Ω ( d 1 . 5 ). 6 Our subquadratic-time algorithm can be viewed as a low-degree, s pectral analogue of the [ BKS15 ] SoS algorithm. Ho wever , rather than r elying on an SDP to produce an object cl ose to P i ∈ [ n ] a ⊗ t i , we manufacture o ne ourselves by taking the Kronecker square of ou r input tensor . W e explicitly know the for m o f the d eviation of T ⊗ 2 fr om P i ∈ [ n ] a ⊗ 6 i , unlike in [ BKS15 ], where t he deviation of t he SDP certificate χ t fr om P i ∈ [ n ] a ⊗ t i is poorly u n d erstoo d. W e are thus able to control this d eviation (or “noise”) in a less comput ationally intens ive way , by cleverly d esigning a partial trace operation which decreases the spectral no r m of the de viation. Since the tensor handled by the algorithm is much smaller —order 6 rathe r than order log n —this provides the des ir ed sp eedup. 6 It is an interesting open question whether taking t = O (l og n ) is really necessary , o r whether this heavy computational requirement is si mp l y an artifact of the SoS proof. 11 2.3 T ensor pr incipal component analysis Recall that in this problem we are given a te nsor T = τ · v ⊗ 3 + A , where v ∈ R d is a unit vecto r , A has iid entries fr om N (0 , 1), and τ > 0 is the s ignal-to-noise ratio. The aim is to recover v appr oximately . Background and SoS analysis. A previ ous application of SoS techniques t o this pr oblem dis- cussed se veral SoS or spectr al algorithms , including one that runs in quasi-linear time [ HSS15 ]. Here we apply t he partial t race method to a subquadratic spe ctral SoS algorithm discussed in [ HSS15 ] to achieve nearly the same signal-to-noise guarantee in only linear time. Our starting p oint is the polynomial T ( x ) = τ · h v , x i 3 + h x ⊗ 3 , A i . T he maximizer of T ( x ) over the sphere is close to the ve cto r v so long as τ ≫ √ n [ RM14 ]. In [ HSS15 ], it was shown that degree-4 SoS maximizing this polynomial can recover v w ith a signal-to-noise ratio of at least ˜ Ω ( n 3 / 4 ), since there exists a suitable SoS bound on the noise term h x ⊗ 3 , A i . Specifically , let A i be the i th slice of A , so that h x , A i x i is the qu adr atic form P j , k A i jk x j x k . Then there is a SoS proof that T ( x ) is bounde d by | T ( x ) − τ · h v , x i 3 | 6 f ( x ) 1 / 2 · k x k , where f ( x ) is the degree-4 polynomial f ( x ) = P i h x , A i x i 2 . The polynomial f has a convenient matrix representation: f ( x ) = h x ⊗ 2 , ( P i A i ⊗ A i ) x ⊗ 2 i : since this matrix is a sum of iid random matrices A i ⊗ A i , it is easy to show that this matrix sp e ctrally concent rates to its exp ectation. S o w ith high p roba bility one can show that the eigenvalues of P i A i ⊗ A i ar e at most ≈ d 3 / 2 log( d ) 1 / 2 (except for a single s purious eigenvector), and it follows that d egree-4 SoS solves tensor PCA so long as τ ≫ d 3 / 4 log( d ) 1 / 4 . This leads the authors to consider a slight modification of f ( x ), given by 1 ( x ) = P i h x , T i x i 2 , where T i is the i th slice o f T . Like T , th e function 1 also contains information about v , and the S oS bound on the no ise term in T carries over as an analogous bound on the noise in 1 . In particular , expanding T i ⊗ T i and ignoring some ne gligible cross-terms yields X i T i ⊗ T i ≈ τ 2 · ( v ⊗ v )( v ⊗ v ) ⊤ + X i A i ⊗ A i Using v ⊗ v as a te st vector , the quadratic form o f the latte r matrix can be made at least τ 2 − O ( d 3 / 2 log( d ) 1 / 2 ). T oge ther with the boundedne ss of t h e eigenvalues of P i A i ⊗ A i this show s that when τ ≫ d 3 / 4 log( d ) 1 / 4 there is a spectral algorithm to recover v . Since the matrix P i T i ⊗ T i is d 2 × d 2 , compu t ing the top eigenvector requir es ˜ O ( d 4 log n ) time, and by comparison to the input size d 3 the algorithm runs in subquadratic time. Improvements. In this work we s peed this up t o a linear time algorithm via the partial trace approac h. As we have seen, the heart of the matter is to sho w that t aking the partial trace o f τ 2 · ( v ⊗ v )( v ⊗ v ) ⊤ + P i A i ⊗ A i does not incr ease the spectral noise. That is, we requir e that T r R d X i A i ⊗ A i = X i T r( A i ) · A i 6 O ( d 3 / 2 log( d ) 1 / 2 ) . Notice that the A i ar e essent ially W igner matrices, and so it is roughly t rue that | T r( A i ) | ≈ k A i k , and the situation is very s imilar to our toy e x ample of the application of partial traces in Section 2 . Heuristically , because P i ∈ [ n ] A i ⊗ A i and P i ∈ [ n ] T r( A i ) · A i ar e random matrices, we exp ect that their eigenvalues are all o f roughly the same magnitude. This means that the ir spectral norm should be close to their Fr obenius norm divided by the square root of the d imension, since for a matrix M with eigenvalues λ 1 , . . . , λ n , k M k F = q P i ∈ [ n ] λ 2 i . By es timating the sum o f the squared 12 entries, we expect t hat t he Fr obenious no rm of P i T r( A i ) · A i is less t han th at of P i A i ⊗ A i by a factor of √ d after the partial trace, w h ile the d imens ion decreases by a factor o f d , and so ass uming that the e igenvalues are all of the s ame order , a typical eige nvalue should remain u nchanged. W e formalize thes e heurist ic calculations using s tandard matrix concentration arguments in Section 6 . 3 Preliminaries Linear algebra. W e will work in the real vecto r sp aces given by R n . A vector of indeterminates may be denoted x = ( x 1 , . . . , x n ), although we may sometimes sw itch to p ar enthetical notation for indexing, i.e. x = ( x (1 ) , . . . , x ( n )) when su bscripts ar e already in use. W e denote by [ n ] the set of all valid indices for a vector in R n . Let e i be the i th canonical basis vector so that e i ( i ) = 1 and e i ( j ) = 0 for j , i . For a vectors space V , we may d enote by L ( V ) t he space of linear ope rators from V to V . The space o rthogonal to a ve cto r v is d enoted v ⊥ . For a matrix M , we us e M − 1 to denote its inverse or its Moo re-Penrose ps eudoinverse ; which one it is will be clear from context. For M PSD, we write M − 1 / 2 for t h e unique PSD matrix with ( M − 1 / 2 ) 2 = M − 1 . Norms and inner products. W e denote the usual entrywise inner product by h· , ·i , so that h u , v i = P i ∈ [ n ] u i v i for u , v ∈ R n . The ℓ p -norm of a vector v ∈ R n is given by k v k p = ( P i ∈ [ n ] v i p ) 1 / p , with k v k denoting the ℓ 2 -norm by de fault. The matrix no rm us e d throughout th e paper will be the operator / spectral n o rm, de noted by k M k = k M k op : = max x , 0 k Mx k / k x k . T ensor manipu lation. Boldface variables will reserved fo r tens o rs T ∈ R n × n × n , o f wh ich we consider only order -3 t e nsors. W e d e note by T ( x , y , z ) the multilinear function in x , y , z ∈ R n such that T ( x , y , z ) = P i , j , k ∈ [ n ] T i , j , k x i y j z k , applying x , y , and z to the first, s e cond, and third mod es o f the tensor T respectively . If the ar guments are matrices P , Q , and R instead, this lifts T ( P , Q , R ) to the unique multil inear tensor-va lued function such that [ T ( P , Q , R )]( x , y , z ) = T ( P x , Q y , Rz ) for all vectors x , y , z . T ensors may be flattene d to matrices in the multilinear way such th at fo r every u ∈ R n × n and v ∈ R n , the tensor u ⊗ v flattens to the matrix u v ⊤ ∈ R n 2 × n with u considered as a ve cto r . There are 3 di ff er ent ways to flatten a 3-tensor T , corresponding to the 3 modes of T . Flattening may be understood as reinterpreting the indices of a t ensor when the tensor is expressed as an 3-dimensional array of nu mbers . The expression v ⊗ 3 refers t o v ⊗ v ⊗ v for a vector v . Probability and asymptotic bo unds. W e will o ft e n refer t o collections o f independe nt and iden- tically d istributed (or iid ) random variables. The Gauss ian d ist ribution with mean µ and variance σ 2 is denot e d N ( µ, σ 2 ). S o metimes we st ate that an event happens with overwhelming probabi lity . This means that its probabili ty is at least 1 − n − ω (1) . A function is ˜ O ( 1 ( n )) if it is O ( 1 ( n )) up t o polylogarithmic factors . 13 4 Planted sparse vector in random linear subspace In this section we g ive a nearly-linear -time algorithm to recover a sp arse vector planted in a random subspace. Problem 4.1. L et v 0 ∈ R n be a unit vector such that k v 0 k 4 4 > 1 ε n . Let v 1 , . . . , v d − 1 ∈ R n be iid from N (0 , 1 n Id n ). Let w 0 , . . . , w d − 1 be an ort h o gonal basis for Span { v 0 , . . . , v d − 1 } . Given: w 0 , . . . , w d − 1 Find: a vector v such that h v , v 0 i 2 > 1 − o (1). Sparse V ector Recovery in Nearly-Linear T ime Algorithm 4.2. Input: w 0 , . . . , w d − 1 as in Probl em 4.1 . Goal: Find v with h ˆ v , v 0 i 2 > 1 − o (1). • Comput e leverage scores k a 1 k 2 , . . . , k a n k 2 , whe re a i is the i th row of the n × d matrix S : = w 0 · · · w d − 1 . • Comput e the top eige nvector u of the matrix A def = X i ∈ [ n ] ( k a i k 2 2 − d n ) · a i a ⊤ i . • Outp ut Su . Remark 4.3 (Implementation of Algorithm 4.2 in nearly-linear time) . The leverage scores k a 1 k 2 , . . . , k a n k 2 ar e clearly compu table in time O ( nd ). In the course of proving correctness of the algorithm we will sh o w that A has constant spectral gap, so by a standard ana lysis O (log d ) matrix-vector multiplies su ffi ce to recover its to p eigenvector . A s ing le matrix-v ector multiply Ax r equires comput ing c i : = ( k a i k 2 − d n ) h a i , x i for each i (in t ime O ( nd )) and s umming P i ∈ [ n ] c i x i (in time O ( nd )). Finally , comput ing Su requires summing d vectors of dimension n , again taking time O ( nd ). The following theorem expresses correctness of th e algorithm. Theorem 4.4 . Let v 0 ∈ R n be a unit vector with k v 0 k 4 4 > 1 ε n . Let v 1 , . . . , v d − 1 ∈ R n be iid from N (0 , 1 n Id n ) . Let w 0 , . . . , w d − 1 be an orthogonal basis for Span { v 0 , . . . , v d − 1 } . Let a i be th e i-th row of the n × d matrix S : = w 0 · · · w d − 1 . When d 6 n 1 / 2 / polylog( n ) , for any sparsi ty ε > 0 , w .ov .p. the top eigenvecto r u of P n i = 1 ( k a i k 2 − d n ) · a i a ⊤ i has h Su , v 0 i 2 > 1 − O ( ε 1 / 4 ) − o (1) . W e have little control o ver the basis vectors t he algorithm is given. However , there is a particularly n ice (albeit non-ort hogonal) basis for the subspace which exposes the underlying randomness. Suppo se that we are g iven the basis vectors v 0 , . . . , v d , where v 0 is the s parse vector normalized so t hat k v 0 k = 1, and v 1 , . . . , v d − 1 ar e iid samples fr om N (0 , 1 n Id n ). The following lemma shows t hat if the algorithm had been hande d th is good repr esentation of the basis rather th an an arbitrary orthog onal one, its output would be t he correlated to the vector of coe ffi cients giving of the planted sparse ve cto r (in this case t h e standard basis vector e 1 ). 14 Lemma 4.5. Let v 0 ∈ R n be a unit vector . L et v 1 , . . . , v d − 1 ∈ R n be iid from N (0 , 1 n Id) . Let a i be the ith r ow of the n × d matrix S : = v 0 · · · v d − 1 . Then there is a universal constant ε ∗ > 0 so that for any ε 6 ε ∗ , so long as d 6 n 1 / 2 / polylog( n ) , w .ov .p. n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i = k v 0 k 4 4 · e 1 e ⊤ 1 + M wher e e 1 is the first standar d basis vector and k M k 6 O ( k v 0 k 3 4 · n − 1 / 4 + k v 0 k 2 4 · n − 1 / 2 + k v 0 k 4 · n − 3 / 4 + n − 1 ) . The se cond ingredient we need is that the algorithm is robust to exchanging this go o d basis for an arbitrary o r t hogonal basis. Lemma 4 .6. L et v 0 ∈ R n have k v 0 k 4 4 > 1 ε n . Let v 1 , . . . , v d − 1 ∈ R n be iid fr om N (0 , 1 n Id n ) . L et w 0 , . . . , w d − 1 be an ortho gonal basi s for Span { v 0 , . . . , v d − 1 } . Let a i be the ith ro w of the n × d matrix S : = v 0 · · · v d − 1 . Let a ′ i be the ith row of the n × d matrix S ′ : = w 0 · · · w d − 1 . Let A : = P i a i a ⊤ i . Let Q ∈ R d × d be the orthog onal matrix so that SA − 1 / 2 = S ′ Q, w hich exists since SA − 1 / 2 is orthogonal, and which has the e ff ect that a ′ i = QA − 1 / 2 a i . Then w hen d 6 n 1 / 2 / polylog( n ) , w .ov .p. n X i = 1 ( k a ′ i k 2 − d n ) · a ′ i a ′⊤ i − Q n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i Q ⊤ 6 O 1 n + o ( k v k 4 4 ) Last, we will need the following fact, wh ich follows from standard concentration. The proof is in Section B . Lemma 4.7. Let v ∈ R n be a u nit vector . Let b 1 , . . . , b n ∈ R d − 1 be iid fr om N (0 , 1 n Id d − 1 ) . Let a i ∈ R d be given by a i : = ( v ( i ) b i ) . Then w .ov .p. k P n i = 1 a i a ⊤ i − Id d k 6 ˜ O ( d / n ) 1 / 2 . In pa rticular , when d = o ( n ) , this implies that w .ov .p. k ( P n i = 1 a i a ⊤ i ) − 1 − I d d k 6 ˜ O ( d / n ) 1 / 2 and k ( P n i = 1 a i a ⊤ i ) − 1 / 2 − I d d k 6 ˜ O ( d / n ) 1 / 2 . W e are ready to prove Theorem 4.4 . Pro of of Theor em 4.4 . Let b 1 , . . . , b n be t he rows of the matrix S ′ : = v 0 · · · v d − 1 . Let B = P i b i b ⊤ i . Note t hat S ′ B − 1 / 2 has columns which are an o rthogonal basis for Span { w 0 , . . . , w d − 1 } . Let Q ∈ R d × d be the rotation so that S ′ B − 1 / 2 = SQ . By Lemma 4.5 and Lemma 4.6 , we can write the matrix A = P n i = 1 ( k a i k 2 2 − d n ) · a i a ⊤ i as A = k v 0 k 4 4 · Q e 1 e ⊤ 1 Q ⊤ + M where w .ov .p. k M k 6 O ( k v 0 k 3 4 · n − 1 / 4 + k v 0 k 2 4 · n − 1 / 2 + k v 0 k 4 · n − 3 / 4 + n − 1 ) + o ( k v k 4 4 ) . W e have assumed that k v 0 k 4 4 > ( ε n ) − 1 , and so since A is an almost-rank-one matrix ( Lemma A.3 ), the top eigenvector u of A has h u , Qe 1 i 2 > 1 − O ( ε 1 / 4 ), so that h Su , SQe 1 i 2 > 1 − O ( ε 1 / 4 ) by column- orthogo n ality o f S . At the s ame time, SQe 1 = S ′ B − 1 / 2 e 1 , and by Lemma 4.7 , k B − 1 / 2 − I d k 6 ˜ O ( d / n ) 1 / 2 w .ov .p., so that h Su , S ′ e 1 i 2 > h Su , SQe 1 i 2 − o (1). Finally , S ′ e 1 = v 0 by definition, s o h Su , v 0 i 2 > 1 − O ( ε 1 / 4 ) − o (1). 15 4.1 Algorithm succeeds on good basis W e no w pr ove Lemma 4.5 . W e decompose the matrix i n question into a contribution from k v 0 k 4 4 and the rest: explicitly , the d ecomposition is P ( k a i k 2 2 − d n ) · a i a ⊤ i = P v ( i ) 2 · a i a ⊤ i + P ( k b i k 2 2 − d n · a i a ⊤ i ). This first lemma handles the cont ribution from k v 0 k 4 4 . Lemma 4.8. Let v ∈ R n be a u nit vector . Let b 1 , . . . , b n ∈ R d − 1 be random vectors iid from N (0 , 1 n · Id d − 1 ) . Let a i = ( v ( i ) b i ) ∈ R d . Suppose d 6 n 1 / 2 / polylog( n ) . Then n X i = 1 v ( i ) 2 · a i a ⊤ i = k v k 4 4 · e 1 e ⊤ 1 + M ′ wher e k M ′ k 6 O ( k v k 3 4 n − 1 / 4 + k v k 2 4 n − 1 / 2 ) w .ov .p.. Pro of of Lemma 4.8 . W e first s h o w an operator-norm bo u nd on the principal s ubmatrix P n i = 1 v ( i ) 2 · b i b ⊤ i using the truncated matrix Berns t ein inequality Proposition A.7 . First, t he expe cted op e rator norm of e ach s ummand is bound ed: E v ( i ) 2 k b i k 2 2 6 (max j v ( j ) 2 ) · O d n ! 6 k v k 2 4 · O d n ! . The ope rator norms are bounded by cons tant-degree po lynomials in Gaussian va riables, so Lemma A.8 applies to truncate t heir tails in preparation for application of a Be rnstein bound. W e just have to calculate the variance o f the sum, which is at most E n X i = 1 v ( i ) 4 k b i k 2 2 · b i b ⊤ i = k v k 4 4 · O d n 2 ! . The expectation E P n i = 1 v ( i ) 2 · b i b ⊤ i is k v k 2 n · Id. Applying a matrix Bernst e in bound ( Proposition A.7 ) to the deviation from expe ctation, we get t hat w .ov .p., n X i = 1 v ( i ) 2 · b i b ⊤ i − 1 n · I d 6 k v k 2 4 · ˜ O d n ! 6 O ( k v k 2 4 n − 1 / 2 ) for appropriate choice of d 6 n − 1 / 2 / polylog( n ). Hence, by triangle inequ ality , k P n i = 1 v ( i ) 2 · b i b ⊤ i k 6 k v k 2 4 n − 1 / 2 w .ov .p.. Using a Cauchy-Schwarz-style ineq u ality ( Lemma A.1 ) we now show that the bound on this principal submatrix is esse n t ially enough to obtain t he lemma. Let p i , q i ∈ R d be given by p i def = v 0 ( i ) · v 0 ( i ) 0 . . . 0 q i def = v 0 ( i ) · 0 b i . Then n X i = 1 v ( i ) 2 · b i b ⊤ i = k v k 4 4 + n X i = 1 p i q ⊤ i + q i p ⊤ i + q i q ⊤ i . 16 W e have al ready bou n d ed P n i = 1 q i q ⊤ i = P n i = 1 v ( i ) 2 · b i b ⊤ i . At the same time, k P n i = 1 p i p ⊤ i k = k v k 4 4 . By Lemma A.1 , t hen, n X i = 1 p i q ⊤ i + q i p ⊤ i 6 O ( k v k 3 4 n − 1 / 4 ) w .ov .p.. A final application of triangle inquality gives the lemma. Our second lemma controls the contribution from the random part of the leverage scores. Lemma 4.9. Let v ∈ R n be a u nit vector . Let b 1 , . . . , b n ∈ R d − 1 be random vectors iid from N (0 , 1 n · Id d − 1 ) . Let a i = ( v ( i ) b i ) ∈ R d . Suppose d 6 n 1 / 2 / polylog( n ) . Then w .ov . p. n X i = 1 ( k b i k 2 2 − d n ) · a i a ⊤ i 6 k v k 2 4 · O ( n − 3 / 4 ) + k v k 4 · O ( n − 1 ) + O ( n − 1 ) . Pro of. L ike in the proof of Lemma 4.8 , P n i = 1 ( k b i k 2 2 − d n ) · a i a ⊤ i decompose s into a convenient bl ock structure; we will bou n d e ach block se parately . n X i = 1 ( k b i k 2 2 − d n ) · a i a ⊤ i = n X i = 1 ( k b i k 2 2 − d n ) · v ( i ) 2 v ( i ) · b ⊤ i v ( i ) · b i b i b ⊤ i . (4.1) In each block we can app ly a (truncated) B ernstein inequality . F or the large block P n i = 1 ( k b i k 2 2 − d n ) b i b ⊤ i , the choice d n ensures t hat E ( k b i k 2 2 − d n ) b i b ⊤ i = O ( 1 n 2 ) · Id. The expe cte d op erator no rm of each summand is small: E k ( k b i k 2 2 − d n ) b i b ⊤ i k = E | ( k b i k 2 2 − d n ) |k b i k 2 2 6 ( E ( k b i k 2 2 − d n ) 2 ) 1 / 2 ( E k b i k 4 2 ) 1 / 2 by Cauchy-Schwarz 6 O d 1 / 2 n ! · O d n ! variance o f χ 2 with k degrees of freedom is O ( k ) = O d 3 / 2 n 2 ! . The termwise ope rator norms are bounde d by constant-degree polynomials in Gaussian variables, so Lemma A.8 applies to truncate the tails of the summands in preparation for a Berns t ein bou nd. W e just have to compute the variance of the sum, w hich is small because w e have cent ered the coe ffi cients: X i E ( k b i k 2 2 − d n ) 2 k b i k 2 2 · b i b ⊤ i 6 O d 2 n 3 ! by direct computation of E ( k b i k 2 2 − d n ) 2 k b i k 2 2 b i b ⊤ i using Fact A.6 . These facts togeth e r are enough t o apply th e matrix Bernste in inequality ( Proposition A.7 ) and conclude that w .ov .p. n X i = 1 ( k b i k 2 2 − d n ) · b i b ⊤ i 6 ˜ O d n 3 / 2 ! 6 O 1 n for appropriate choice of d 6 n / polylog( n ). 17 W e t urn to the other blocks from (4.1) . The upper-left block contains just the s calar P n i = 1 ( k b i k 2 2 − d n ) v ( i ) 2 . By standard concentration each term is bound e d: w .o v .p., ( k b i k 2 2 − d n ) v ( i ) 2 6 (max i v ( i ) 2 ) · ˜ O d 1 / 2 n ! 6 k v k 2 4 · ˜ O d 1 / 2 n ! . The sum has variance at mo s t P n i = 1 v ( i ) 4 E ( k b i k 2 2 − d n ) 2 6 k v k 4 4 · O ( d / n 2 ). A g ain using Lemma A.8 and Proposition A.7 , we get that w .ov .p. n X i = 1 ( k b i k 2 2 − d n ) v ( i ) 2 6 k v k 2 4 · ˜ O d 1 / 2 n ! . It remai ns just to add ress the block P n i = 1 ( k b i k 2 2 − d n ) v ( i ) · b i . Each te rm in the sum has expe cted operator n o rm at most (max i v ( i ) 2 ) 1 / 2 · O d n 3 / 2 ! 6 k v k 4 · O d n 3 / 2 ! · , and once again the since the summands’ op erator norms are bounded by constant-d e gree poly- nomials of Gaussian variables Lemma A.8 applies to truncate their tails in p reparation to apply a Ber n s tein bound. The variance of the sum is at most k v k 2 2 · O ( d 2 / n 3 ), again by Fact A.6 . Finally , Lemma A.8 and P roposition A.7 apply to give that w .ov .p. n X i = 1 ( k b i k 2 2 − d n ) v ( i ) · b i 6 k v k 4 · ˜ O d n 3 / 2 ! + ˜ O d n 3 / 2 ! = k v k 4 · n − 1 + n − 1 for appropriate choice of d 6 n 1 / 2 / polylog( n ). Putting it all tog e ther gives the lemma. W e are now ready to prove Lemma 4.5 Pro of of Lemma 4.5 . W e decompose k a i k 2 2 = v 0 ( i ) 2 + k b i k 2 2 and us e L emma 4.8 and L emma 4.9 . n X i = 1 ( k a i k 2 2 − d n ) · a i a ⊤ i = n X i = 1 v 0 ( i ) 2 · a i a ⊤ i + n X i = 1 ( k b i k 2 2 − d n ) · a i a ⊤ i = k v 0 k 4 4 · e 1 e ⊤ 1 + M , where k M k 6 O ( k v 0 k 3 4 · n − 1 / 4 + k v 0 k 2 4 · n − 1 / 2 ) + O ( k v 0 k 4 · n − 1 + n − 1 ) . Since k v 0 k 4 4 > ( ε n ) − 1 , we ge t k v 0 k 4 4 / k M k > 1 ε 1 / 4 , complet ing the proof. 4.2 Closeness of input basis and good basis W e turn now to the pr oof of Lemma 4.6 . W e reca ll the setting. W e have two matrices: M , which the algorithm computes , and M ′ , wh ich is induced by a basis for the s u bspace which reveals the underlying randomness and whic h we prefer for the analysis. M ′ di ff ers from M by a r otation and a basis ortho g onalization step (the good basis is only almost orthogo nal). The r otation is easily handled. The following lemma gives the critical fact about t he or t hogonalization step : orthogo n alizing d o es not change t he leverage s cores t oo much. 7 7 Strictly speaking the good basis doe s not have leverage scores since it is not orthogonal, but we can still talk about the norms of the rows of the matrix whose columns are the basis vectors. 18 Lemma 4.10 (Re s tatement o f L emma B .4 ) . Let v ∈ R n be a u nit vector and let b 1 , . . . , b n ∈ R d − 1 be iid fr om N (0 , 1 n Id d − 1 ) . Let a i ∈ R d be give n by a i : = ( v ( i ) b i ) . Let A : = P i a i a ⊤ i . L et c ∈ R d − 1 be give n by c : = P i v ( i ) b i . Then for every index i ∈ [ n ] , w .ov . p., k A − 1 / 2 a i k 2 − k a i k 2 6 ˜ O d + √ n n ! · k a i k 2 . The proof again uses stand ard concen t ration and matrix inversion formulas, and can be found in S ection B . W e are ready to prove L emma 4.6 . Pro of of Lemma 4.6 . The state me nt we want to show is n X i = 1 ( k a ′ i k 2 − d n ) · a ′ i a ′⊤ i − Q n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i Q ⊤ 6 O 1 n + o ( k v k 4 4 ) . Conjugating by Q and multiplying by − 1 doe s not change the operator norm, so th at this is equivalent to n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i − Q ⊤ n X i = 1 ( k a ′ i k 2 − d n ) · a ′ i a ′⊤ i Q 6 O 1 n + o ( k v k 4 4 ) . Finally , substituting a ′ i = QA − 1 / 2 a i , and us ing the fact that Q is a r otation, it will be enou gh t o show n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i − A − 1 / 2 n X i = 1 ( k A − 1 / 2 a i k 2 − d n ) · a i a ⊤ i A − 1 / 2 6 O 1 n + o ( k v k 4 4 ) . (4.2) W e write the right -hand matrix as A − 1 / 2 n X i = 1 ( k A − 1 / 2 a i k 2 − d n ) · a i a ⊤ i A − 1 / 2 = A − 1 / 2 n X i = 1 ( k A − 1 / 2 a i k 2 − k a i k 2 ) · a i a ⊤ i A − 1 / 2 + A − 1 / 2 n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i A − 1 / 2 . The first of these we observe has bo u nded o p erator norm w .ov .p .: A − 1 / 2 n X i = 1 ( k A − 1 / 2 a i k 2 − k a i k 2 ) · a i a ⊤ i A − 1 / 2 6 A − 1 / 2 n X i = 1 |k A − 1 / 2 a i k 2 − k a i k 2 | · a i a ⊤ i A − 1 / 2 6 ˜ O d + √ n n ! · n X i = 1 k a i k 2 · a i a ⊤ i where we have u sed Lemma 4.7 to find that A 1 / 2 is close to ide ntity , and L emma 4.10 to simplify the summands = ˜ O d + √ n n ! · n X i = 1 v 0 ( i ) 2 · a i a ⊤ i + n X i = 1 k b i k 2 2 · a i a ⊤ i 19 6 ˜ O d + √ n n ! · O ( k v k 4 4 ) + ˜ O d n !! , using in the last step Lemma 4.8 and s tandard concentration to bound P n i = 1 k b i k 2 2 · a i a ⊤ i ( Lemma 4.7 ). Thus, by triangle inequality applied to (4.2) , we get n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i − A − 1 / 2 n X i = 1 ( k A − 1 / 2 a i k 2 − d n ) · a i a ⊤ i A − 1 / 2 6 ˜ O d + √ n n ! · O ( k v k 4 4 ) + ˜ O d n !! + n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i − A − 1 / 2 n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i A − 1 / 2 . Finally , since w .ov .p. k A − 1 / 2 − I d k = ˜ O ( d / n ) 1 / 2 , we get n X i = 1 ( k a i k 2 − d n ) · a i a ⊤ i − A − 1 / 2 n X i = 1 ( k A − 1 / 2 a i k 2 − d n ) · a i a ⊤ i A − 1 / 2 6 ˜ O d + √ n n ! · O ( k v k 4 4 ) + ˜ O d n !! + ˜ O d n ! 1 / 2 · n X i = 1 ( k a i k 2 2 − d n ) · a i a ⊤ i 6 ˜ O d + √ n n ! · O ( k v k 4 4 ) + ˜ O d n !! + ˜ O d n ! 1 / 2 · O ( k v k 4 4 ) . using Lemma 4.5 in the la st step. F or app ropriate choice of d 6 n − 1 / 2 / polylog( n ), this is at most O ( n − 1 ) + o ( k v k 4 4 ). 5 Overcomplete tensor decomposition In t h is section, we give a p o lynomial-time algorithm for the fo llowing problem when n 6 d 4 / 3 / (polylog d ): Problem 5.1. Given an or der-3 tensor T = P n i = 1 a i ⊗ a i ⊗ a i , where a 1 , . . . , a n ∈ R d ar e ii d vectors sampled from N (0 , 1 d Id), find vectors b 1 , . . . , b n ∈ R n such that for all i ∈ [ n ], h a i , b i i > 1 − o (1) . W e give an algorithm that solves this p robl em, so long as the o verc ompletenes s of the input tensor is bounded such that n ≪ d 4 / 3 / polylog d . Theorem 5.2. Given as input the tensor T = P n i = 1 a i ⊗ a i ⊗ a i wher e a i ∼ N (0 , 1 d Id d ) with d 6 n 6 d 4 / 3 / polylog d, 8 ther e is an algorithm which may run in time ˜ O ( nd 1 + ω ) or ˜ O ( nd 3 . 257 ) , wher e d ω is the time to mu ltiply two d × d matrices, w hich with pro bability 1 − o (1) over the input T and the randomness of the algori thm finds un it vectors b 1 , . . . , b n ∈ R d such that for all i ∈ [ n ] , h a i , b i i > 1 − ˜ O n 3 / 2 d 2 ! . 8 The lower bound d 6 n on n , is a matter of technical convenience, avoiding separ ate concentration analyses and arithmetic in the undercomplete ( n < d ) and overcomplete ( n > d ) settings. Indeed, our algorithm still works in the undercomplete setting (tensor decomposi tio n i s easier in the undercomplete s etting than the overcomplete o ne), but here other algorithms based on local search also work [ AGJ15 ]. 20 W e remark that this accuracy can be impr oved from 1 − ˜ O ( n 3 / 2 / d 2 ) to an arbitrarily good p reci - sion us ing e x ist ing local se ar ch methods with local conve rgence guarantees — s ee Corolla ry 5.23 . As discuss ed in Section 2 , to decompose t he tensor P i a ⊗ 6 i (note we do not actually have access to this input!) t here is a very simple tensor de composition algorithm: sample a random 1 ∈ R d 2 and compute the matrix P i h 1 , a ⊗ 2 i i ( a i a ⊤ i ) ⊗ 2 . W ith probabil ity roughly n − O ( ε ) this matrix has (up to scaling) the form ( a i a ⊤ i ) ⊗ 2 + E for some k E k 6 1 − ε , and this is enoug h to recover a i . However , instead of P i a ⊗ 6 i , we have only P i , j ( a i ⊗ a j ) ⊗ 3 . Unfortunately , running the same algorithm o n the latter input will not s ucceed. T o se e why , consider the ex t ra terms E ′ : = P i , j h 1 , a i ⊗ a j i ( a i ⊗ a j ) ⊗ 2 . Since |h 1 , a i ⊗ a j i| ≈ 1, it is st r aightfo r w ar d t o see that k E ′ k F ≈ n . Since the rank of E ′ is clearly d 2 , even if we are lucky and all the eigenvalues hav e s imilar magnitudes , still a typical eigenvalue will be ≈ n / d ≫ 1, swallowing the P i a ⊗ 6 i term. A convenient feature s eparating the s ignal terms P i ( a i ⊗ a i ) ⊗ 3 fr om t he crossterms P i , j ( a i ⊗ a j ) ⊗ 3 is that the cr ossterms ar e n o t within the span of the a i ⊗ a i . Although we cannot algorithmica lly access Span { a i ⊗ a i } , we have access to something almost as good: the unfolded input tens or , T = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . The rows o f this matrix lie in Span { a i ⊗ a i } , and so for i , j , k T ( a i ⊗ a i ) k ≫ k T ( a i ⊗ a j ) k . In fact, careful computation reveals that k T ( a i ⊗ a i ) k > ˜ Ω ( √ n / d ) k T ( a i ⊗ a j ) k . The idea now is to replace P i , j h 1 , a i ⊗ a j i ( a i ⊗ a j ) ⊗ 2 with P i , j h 1 , T ( a i ⊗ a j ) i ( a i ⊗ a j ) ⊗ 2 , now with 1 ∼ N (0 , Id d ). A s before, we ar e hoping that th e re is i 0 so that h 1 , T ( a i 0 ⊗ a i 0 ) i ≫ max j , i 0 h 1 , T ( a j ⊗ a j ) i . But now we also requir e k P i , j h 1 , T ( a i ⊗ a j ) i ( a i ⊗ a j )( a i ⊗ a j ) ⊤ k ≪ h 1 , T ( a i 0 ⊗ a i 0 ) i ≈ k T ( a i ⊗ a i ) k . If we ar e lucky and all the eige nvalues of t his cr oss-term matrix have roughly the same magnitude (indeed, we will be lucky in this w ay), the n we can est imate heurist ically that X i , j h 1 , T ( a i ⊗ a j ) i ( a i ⊗ a j )( a i ⊗ a j ) ⊤ ≈ 1 d X i , j h 1 , T ( a i ⊗ a j ) i ( a i ⊗ a j )( a i ⊗ a j ) ⊤ F 6 1 d · √ n d |h 1 , T ( a i 0 ⊗ a i 0 ) i| X i , j ( a i ⊗ a j )( a i ⊗ a j ) ⊤ F 6 n 3 / 2 d 2 |h 1 , T ( a i 0 ⊗ a i 0 ) i| , sugges ting our algorithm will su cced whe n n 3 / 2 ≪ d 2 , which is t o s ay n ≪ d 4 / 3 . The following theo rem, which formalizes the intuition above, is at t he heart of our tensor decomposition algorithm. Theorem 5.3. L et a 1 , . . . , a n be independent random vectors fr om N (0 , 1 d Id d ) with d 6 n 6 d 4 / 3 / (polylog d ) and le t 1 be a random ve ctor fr om N (0 , Id d ) . Let Σ : = E x ∼N (0 , Id d ) ( xx ⊤ ) ⊗ 2 and le t R : = √ 2 · ( Σ + ) 1 / 2 . Let T = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . Define the matrix M ∈ R d 2 × d 2 , M = X i , j ∈ [ n ] h 1 , T ( a i ⊗ a j ) i · ( a i ⊗ a j )( a i ⊗ a j ) ⊤ . Wi th pr obabi lity 1 − o (1) ove r the choice of a 1 , . . . , a n , for every polylog d / √ d < ε < 1 , the spectral gap of R MR is at least λ 2 /λ 1 6 1 − O ( ε ) and the top eigenvecto r u ∈ R d 2 of RMR satisfies, with pr obabil ity ˜ Ω (1 / n O ( ε ) ) over the choice of 1 , max i ∈ [ n ] h Ru , a i ⊗ a i i 2 / k u k 2 · k a i k 4 > 1 − ˜ O n 3 / 2 ε d 2 ! . 21 Mor eover , with pr obabili ty 1 − o (1) over the choice of a 1 , . . . , a n , for every polylog d / √ d < ε < 1 there are events E 1 , . . . , E n so tha t P 1 E i > ˜ Ω (1 / n 1 + O ( ε ) ) for all i ∈ [ n ] and when E i occurs, h Ru , a i ⊗ a i i 2 / k u k 2 · k a i k 4 > 1 − ˜ O n 3 / 2 ε d 2 . W e will e ventually set ε = 1 / log n , which gives us a spe ctral algorithm for recovering a vector (1 − ˜ O ( n / d 3 / 2 ))-corr elated to some a ⊗ 2 i . Once w e have a vector correlated with each a ⊗ 2 i , obtaining vectors close to the a i is straightforw ard. W e wil l beg in by proving this the o rem, and defer the algorithmic details t o s ection Section 5.4 . The rest o f this s ection is organized as follows. In Section 5.1 we p rove T heorem 5.3 using two core facts: t he Gauss ian vector 1 is close r to some a i than to any other with g ood probabili ty , and the noise term P i , j h 1 , T ( a i ⊗ a j ) i ( a i ⊗ a j )( a i ⊗ a j ) ⊤ is bounded in spectral norm. In Section 5.2 we prove the first of these two facts, and in Section 5.3 we pr ove the second . In Section 5.4 , w e give the full d etails o f our tensor decompos ition algorithm, then prove Theorem 5.2 using Theorem 5.3 . Finally , Section C cont ains p roofs of e lementary or long -winded lemmas we use along t h e way . 5.1 Proof of Theorem 5.3 The st rategy to prove Theorem 5.3 is to d ecompose the matrix M into two parts M = M diag + M cros s , one formed by d iagonal terms M diag = P i ∈ [ n ] h 1 , T ( a i ⊗ a i ) i · ( a i ⊗ a i )( a i ⊗ a i ) ⊤ and one formed by cross terms M cros s = P i , j h 1 , T ( a i ⊗ a j ) i · ( a i ⊗ a j )( a i ⊗ a j ) ⊤ . W e will use the fa ct that the top eige nvector M diag is likely t o be correlated with one of the vectors a ⊗ 2 j , and also the fact that the sp e ctral gap of M diag is not iceable. The following t wo propositions capture the relevant facts about the s pectra of M diag and M cros s , and will be p roven in Section 5.2 and Section 5.3 . Proposition 5.4 (Spe ctr al gap of diagonal terms) . L et R = √ 2 · (( E ( xx ⊤ ) ⊗ 2 ) + ) 1 / 2 for x ∼ N (0 , Id d ) . Let a 1 , . . . , a n be independent random vector s fr om N (0 , 1 d Id d ) with d 6 n 6 d 2 − Ω (1) and let 1 ∼ N (0 , Id d ) be inde pendent of all the others . Let T : = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . Suppose M diag = P i ∈ [ n ] h 1 , T a ⊗ 2 i i · ( a i a ⊤ i ) ⊗ 2 . Let also v j be such that v j v ⊤ j = h 1 , T a ⊗ 2 j i · ( a j a ⊤ j ) ⊗ 2 . Then, with probabil ity 1 − o (1) over a 1 , . . . , a n , for each ε > polylog d / √ d and each j ∈ [ n ] , the event E j ,ε def = RM diag R − ε · Rv j v ⊤ j R 6 RM diag R − ε − ˜ O √ n / d · Rv j v ⊤ j R has pr obabilit y at least ˜ Ω (1 / n 1 + O ( ε ) ) over the choice of 1 . Second, we s how that w h e n n ≪ d 4 / 3 the spectral norm o f M cros s is neg ligible compared to t his spectral g ap. Proposition 5.5 ( Bound on crossterms) . L et a 1 , . . . , a n be independent random vectors fr om N (0 , 1 d Id d ) , and let 1 be a random vector from N (0 , Id d ) . L et T : = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . L et M cros s : = P i , j ∈ [ n ] h 1 , T ( a i ⊗ a j ) i a i a ⊤ i ⊗ a j a ⊤ j . Suppose n > d. Then with w .ov .p., k M cros s k 6 ˜ O n 3 d 4 ! 1 / 2 . 22 Using thes e t wo propositions we will conclude that the t op e igenvector of R MR is likely t o be correla ted w ith one of the vector s a ⊗ 2 j . W e also need t wo simple concent ration bounds; we de fer the proof to the appendix. Lemma 5.6. Let a 1 , . . . , a n be independently sampled vect ors from N (0 , 1 d Id d ) , and let 1 be sampled from N (0 , Id d ) . Let T = P i a i ( a i ⊗ a i ) ⊤ . Then with overwhelming pr obability , for every j ∈ [ n ] , h 1 , T ( a j ⊗ a j ) i − h 1 , a j ik a j k 4 6 ˜ O √ n d ! Fact 5.7 (Simple version of Fact C.1 ) . Let x , y ∼ N (0 , 1 d Id) . With overwhe lming proba bility , 1 − k x k 2 6 ˜ O (1 / √ d ) and h x , y i 2 = ˜ O (1 / d ) . As a last technical too l we will need a s imple claim about t he fourt h mome nt matrix of t h e multivaria te Gaussian: Fact 5.8 (simple version o f Fact C.4 ) . Let Σ = E x ∼N (0 , Id d ) ( xx ⊤ ) ⊗ 2 and let R = √ 2 ( Σ + ) 1 / 2 . Then k R k = 1 , and for any v ∈ R d , k R ( v ⊗ v ) k 2 2 = 1 − 1 d + 2 · k v k 4 . W e are prepar ed prove Theorem 5.3 . Pro of of Theor em 5.3 . Let d 6 n 6 d 4 / 3 / (polylog d ) for s ome p olylog d to be chosen later . L e t a 1 , . . . , a n be independe nt random vectors from N (0 , 1 d Id d ) and let 1 ∼ N (0 , Id d ) be independ ent of the others . Let M diag : = X i ∈ [ n ] h 1 , T ( a i ⊗ a i ) i · ( a i a ⊤ i ) ⊗ 2 and M cros s : = X i , j ∈ [ n ] h 1 , T ( a i ⊗ a j ) i · a i a ⊤ i ⊗ a j a ⊤ j . Note t hat M : = M diag + M cros s . Proposition 5.5 implies that P n k M cros s k 6 ˜ O ( n 3 / 2 / d 2 ) o > 1 − d − ω (1) . (5.1) Recall that Σ = E x ∼N (0 , Id d ) ( xx ⊤ ) ⊗ 2 and R = √ 2 · ( Σ + ) 1 / 2 . By Proposition 5.4 , with probabil ity 1 − o (1) over the choice of a 1 , . . . , a n , e ach of t he following events E j ,ε for j ∈ [ n ] and ε > polylog( d ) / √ d has probab ility at least ˜ Ω (1 / n 1 + O ( ε ) ) over t he cho ice o f 1 : E 0 j ,ε : R M diag − ε h 1 , T a ⊗ 2 j i ( a j a ⊤ j ) ⊗ 2 R 6 k RM diag R k − ( ε − ˜ O ( √ n / d )) · |h 1 , T a j ⊗ 2 i| · k Ra j ⊗ 2 k 2 . T ogethe r with (5.1) , with p roba bility 1 − o (1) over the choice of a 1 , . . . , a n , e ach of the following events E ∗ j ,ε has probability at least ˜ Ω (1 / n 1 + O ( ε ) ) − d − ω (1) > ˜ Ω (1 / n 1 + O ( ε ) ) over t he cho ice o f 1 , E ∗ j ,ε : R M − ε h 1 , T a ⊗ 2 j i ( a j a ⊤ j ) ⊗ 2 R 6 R · M diag · R − ( ε − ˜ O ( √ n / d )) · |h 1 , T a j ⊗ 2 i| · k Ra j ⊗ 2 k 2 + ˜ O ( n 3 / 2 / d 2 ) 6 k R · M · R k − ( ε − ˜ O ( √ n / d )) · |h 1 , T a j ⊗ 2 i| · k Ra j ⊗ 2 k 2 + ˜ O ( n 3 / 2 / d 2 ) . 23 Here, we used that M = M diag + M cros s and that k R · M cros s · R k 6 k M cros s k as k R k 6 1 ( Fact 5.8 ). By standard reasoning about the to p eigen ve cto r of a matrix with a sp ectral gap (recor ded in Lemma A.3 ), th e e vent E ∗ j ,ε implies t hat the top e igenvector u ∈ R d 2 of R · M · R satisfies * u , Ra ⊗ 2 j k Ra ⊗ 2 j k + 2 > 1 − ˜ O ( √ n / d ) ε k Ra ⊗ 2 j k 2 − ˜ O ( n 3 / 2 / d 2 ) ε k Ra ⊗ 2 j k 2 |h 1 , T a ⊗ 2 j i| . Since k Ra j ⊗ 2 k 2 > Ω ( k a j k 4 ) (by Fact 5.8 ), and since k a j k > 1 − ˜ O (1 / √ d ) (by F act 5.7 ), > 1 − ˜ O √ n ε d ! − ˜ O ( n 3 / 2 / d 2 ) ε · |h 1 , Ta ⊗ 2 j i| Now , by Lemma 5.6 we have that for all j ∈ [ n ], |h 1 , T a ⊗ 2 j i − h 1 , a j ik a j k 4 | 6 ˜ O ( √ n / d ) with probab ility 1 − n − ω (1) . By standar d concentration (see F act C.1 for a proof) |h 1 , a j ik a j k 4 − 1 | 6 ˜ O (1 / √ d ) for all j ∈ [ n ] with probabi lity 1 − n − ω (1) . Therefor e with overwhelming probabil ity , the final term is bounded by ˜ O ( n 3 / 2 /ε d 2 ). A union bound now gives the de sired conclusion. Finally , we give a bou n d on the spe ctral gap. W e note that the second eigenvecto r w has h u , w i = 0, and th e refor e * w , Ra ⊗ 2 j k Ra ⊗ 2 j k + = * w , Ra ⊗ 2 j k Ra ⊗ 2 j k − u + 6 Ra ⊗ 2 j k Ra ⊗ 2 j k − u 6 ˜ O ( n 3 / 2 /ε d 2 ) . Thus, using o u r above bound on k R ( M − ε h 1 , T a ⊗ 2 j i ( a j a ⊤ j ) ⊗ 2 ) R k and the concentration bounds we have alr eady applied for k a j k , h 1 , T a ⊗ 2 j i , and k Ra ⊗ 2 j k , we have that λ 2 ( RMR ) = w ⊤ RMRw = w ⊤ R M − ε h 1 , T a ⊗ 2 j i · ( a j a ⊤ j ) ⊗ 2 Rw + ε h 1 , Ta ⊗ 2 j ih w , Ra ⊗ 2 j i 2 6 R M − ε h 1 , T a ⊗ 2 j i · ( a j a ⊤ j ) ⊗ 2 R + ˜ O ( n 3 / 2 /ε d 2 ) 6 1 − ˜ O ( ε ) + ˜ O ( n 3 / 2 /ε d 2 ) . W e conclude that the above eve n t s also imply that λ 2 ( RMR ) /λ 1 ( RMR ) 6 1 − O ( ε ). 5.2 Spectral gap for diagonal terms: pr oof of Proposition 5.4 W e no w prove that the signal matrix, when preconditioned by R , has a noticeable spectral gap: Proposition (Restatement o f Proposition 5.4 ) . Let R = √ 2 · (( E ( xx ⊤ ) ⊗ 2 ) + ) 1 / 2 for x ∼ N (0 , Id d ) . Let a 1 , . . . , a n be independent random vectors fr om N (0 , 1 d Id d ) with d 6 n 6 d 2 − Ω (1) and let 1 ∼ N (0 , Id d ) be independent of all the others. Let T : = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . Suppose M diag = P i ∈ [ n ] h 1 , T a ⊗ 2 i i · ( a i a ⊤ i ) ⊗ 2 . Let also v j be such tha t v j v ⊤ j = h 1 , T a ⊗ 2 j i · ( a j a ⊤ j ) ⊗ 2 . Then, with pr obability 1 − o (1) over a 1 , . . . , a n , for each ε > polylog d / √ d and each j ∈ [ n ] , the event E j ,ε def = RM diag R − ε · Rv j v ⊤ j R 6 RM diag R − ε − ˜ O √ n / d · Rv j v ⊤ j R has pr obabilit y at least ˜ Ω (1 / n 1 + O ( ε ) ) over the choice of 1 . 24 The proof has two p arts . First we s h o w that fo r a 1 , . . . , a n ∼ N (0 , Id d ) the matrix P : = P i ∈ [ n ] ( a i a ⊤ i ) ⊗ 2 has tightly bounded spe ctral norm when preconditioned with R : more precisely , that k RP R k 6 1 + ˜ O ( n / d 3 / 2 ). Lemma 5.9. Let a 1 , . . . , a n ∼ N (0 , 1 d Id d ) be independent random vector s with d 6 n . Let R : = √ 2 · (( E ( aa ⊤ ) ⊗ 2 ) + ) 1 / 2 for a ∼ N (0 , Id d ) . For S ⊆ [ n ] , let P S = P i ∈ S ( a i a i ⊤ ) ⊗ 2 and let Π S be the pro jector into the subspace spanned by { Ra ⊗ 2 i | i ∈ S } . Then, with pr obability 1 − o (1) over the choice of a 1 , . . . , a n , ∀ S ⊆ [ n ] . 1 − ˜ O ( n / d 3 / 2 ) · Π S RP S R 1 + ˜ O ( n / d 3 / 2 ) · Π S . Remark 5.10 . I n [ GM15 , Lemma 5] a similar lemma to t h is one is proved in the con t ext of t he S oS proof syste m. Howe ver , s ince Ge and Ma leverage the full power o f the SoS algorithm the ir proof goes via a spectral boun d on a di ff erent (but related) matrix. S ince our algorithm avoids solving an S DP we n e ed a bound on this matrix in particular . The p roof of Lemma 5.9 proceeds by s t andar d s pectral concentration for tall matrices with independe nt columns (here the columns ar e Ra ⊗ 2 i ). The ar c of the proof is straightforward but it involves some bookke e ping; we have d eferred it t o Section C.0.4 . W e also ne ed the following lemma on t h e concentration of some s calar random variables involving R ; the proof is straightforward by finding t he eige nbasis of R and applying standard concentration, and it is deferred to the appendix. Lemma 5 .11. Let a 1 , . . . , a n ∼ N (0 , 1 d Id d ) . Let Σ , R be as in Fact 5.8 . Let u i = a i ⊗ a i . W ith overwhelming pr obabili ty , every j ∈ [ n ] satisfie s P i , j h u j , R 2 u i i 2 = ˜ O ( n / d 2 ) and | 1 − k Ru j k 2 | 6 ˜ O (1 / √ d ) . The ne xt lemma is the linchpin of the proof of Proposition 5.4 : one of the inner products h 1 , T a j ⊗ 2 i , is likely to be a ≈ (1 + 1 / log( n ))-factor larger than the maximum of the inner pr oducts h 1 , T a i ⊗ 2 i over i , j . T ogethe r with standar d linear algebra these imply that the ma trix M diag = P i ∈ [ n ] h 1 , T a ⊗ 2 i i ( a i a ⊤ i ) ⊗ 2 has to p e ige nvector highly correlated or anticorrelated with some a i . Lemma 5.12. Let a 1 , . . . , a n ∈ R d be independent random vectors fr om N (0 , 1 d Id d ) , and let 1 be a random vector from N (0 , Id d ) . Let T = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . Let ε > 0 and j ∈ [ n ] . Then with overwhelming pr obabili ty over a 1 , . . . , a n , the following event ˆ E j ,ε has pr obabilit y 1 / n 1 + O ( ε ) + ˜ O (1 / √ d ) over the choice of 1 , ˆ E j ,ε = ( h 1 , T a ⊗ 2 j i > (1 + ε )(1 − ˜ O (1 / √ d )) · max i , j h 1 , T a ⊗ 2 i i ) . Now we can p rove Proposition 5.4 . Pro of of Pro positio n 5.4 . Let u i : = a ⊗ 2 i . Fix j ∈ [ n ]. W e beg in by showing a lower bound on the spectral no rm k RM diag R k . k RM diag R k = max k v k = 1 h v , RM diag Rv i > h Ru j , ( RM diag R ) Ru j i k Ru j k 2 = 1 k Ru j k 2 h 1 , T u j ik Ru j k 4 + h Ru j , X i , j h 1 , T i i Ru i u ⊤ i R · R u j i 25 From Lemma 5.12 , the random vector 1 is closer to T u j than to all T u i for i , j , i ∈ [ n ] with reasonabl e probabi lity . More concretely there is s ome polylog d so that as long as ε > polylog d / √ d there is some α = Θ ( ε ) with 1 − ε = 1 / [(1 + α )(1 − ˜ O ( d − 1 / 2 ))] so that with w .o v .p. over a 1 , . . . , a n the following event (a dir ect consequ e nce of ˆ E j ,ε ) has probability ˜ Ω (1 / n 1 + O ( α ) + ˜ O ( d − 1 / 2 ) ) = ˜ Ω (1 / n 1 + O ( ε ) ) over 1 : − (1 − ε ) |h 1 , T u j i| · X i , j Ru i u ⊤ i R X i , j h 1 , T u i i · Ru i u ⊤ i R (1 − ε ) |h 1 , T u j i| · X i , j Ru i u ⊤ i R . (5.2) When (5.2) occurs, k RM diag R k > 1 k Ru j k 2 |h 1 , T u j i|k Ru j k 4 − (1 − ε ) |h 1 , T u j i|h Ru j , X i , j Ru i u ⊤ i R · R u j i = |h 1 , T u j i| k Ru j k 2 k Ru j k 4 − (1 − ε ) X i , j h u j , R 2 u i i 2 > |h 1 , T u j i| 1 − ˜ O (1 / √ d ) − (1 − ε ) ˜ O ( n / d 2 ) 1 + ˜ O (1 / √ d ) w .ov .p. o ve r a 1 , . . . , a n ( Lemma 5.11 ) > |h 1 , T u j i| · (1 − η norm ) , (5.3) where we have chose n some 0 6 η norm 6 ˜ O (1 / √ d ) + ˜ O ( n / d 2 ) (since for any x ∈ R , (1 + x )(1 − x ) 6 1). Next we exhibit an upper bound on k RM diag R − ε h 1 , T u j i Ru j u ⊤ j R k . Again when (5.2) occurs, k RM diag R − ε h 1 , T u j i Ru j u ⊤ j R k (5.4) = (1 − ε ) h 1 , T u j i Ru j u ⊤ j R + X i , j h 1 , T u i i Ru i u ⊤ i R 6 (1 − ε ) |h 1 , Tu j i| X i ∈ [ n ] Ru i u ⊤ i R when (5.2) occurs 6 (1 − ε ) |h 1 , Tu j i| (1 + ˜ O ( n / d 1 . 5 )) w .p. 1 − o (1) over a 1 , . . . , a n by Lemma 5.9 6 (1 − ε ) |h 1 , Tu j i| (1 + η gap ) (5.5) where we have chosen some 0 6 η gap 6 ˜ O ( n / d 1 . 5 ). Putting toge t her (5.3) and (5.5) w ith our bounds on η norm and η gap and recal ling the conditions on (5.2) , we have shown that P a 1 ,..., a n n P 1 {k RM diag R − ε h 1 , T u j i Ru j u ⊤ j R k 6 k RM diag R k − ( ε − ˜ O ( √ n / d )) · |h 1 , T u j i| · k Ru j k 2 } > ˜ Ω (1 / n 1 + O ( ε ) ) o > 1 − o (1) . This concludes t he argument. W e now turn to proving that with reasonable probabil ity , 1 is close r to some T a ⊗ 2 j than all others. 26 Pro of of Lemma 5.12 . T o avoid proliferation of indices, withou t loss of generality fix j = 1. W e begin by expanding t he e xpression h 1 , T a ⊗ 2 i i , h 1 , T a ⊗ 2 i i = X ℓ ∈ [ n ] h 1 , a ℓ ih a ℓ , a i i 2 = k a i k 4 h 1 , a i i + X ℓ , i h 1 , a ℓ ih a ℓ , a i i 2 . The latter s um is bounded by X ℓ , i h 1 , a ℓ ih a ℓ , a i i 2 6 ˜ O √ n d ! , with overwhelming probability fo r all i and choices of 1 ; this follows from a Bers ntein bound, given in Lemma 5.6 . For ease of notation, let ˆ a i def = a i / k a i k 2 . W e conclude fr om Fact 5.7 t hat with overwhelming probab ility , 1 − ˜ O (1 / √ d ) 6 k a i k 2 6 1 + ˜ O (1 / √ d ) for all i ∈ [ n ]. T hus k a i k 2 is r oughly equal for all i , and we may direct our attention to h 1 , ˆ a i i . Let G 1 be the e vent t hat √ 2 α log 1 / 2 n 6 |h 1 , ˆ a 1 i| 6 d 1 / 4 for some α 6 d 1 / 2 − Ω (1) to be chosen later . W e note that h 1 , ˆ a 1 i is distributed as a standard gaussian, and that 1 is indep endent of a 1 , . . . , a n . Thus, we can use standard tail est imates on univariate Gaussians ( Lemma A.4 ) to conclude t hat P |h 1 , ˆ a 1 i| > √ 2 α log 1 / 2 n = ˜ Θ ( n − α ) and P |h 1 , ˆ a 1 i| > d 1 / 4 = Θ exp( − √ d / 2) d 1 / 4 . So by a union bound , P ( G 1 ) > ˜ Ω ( n − α ) − O ( e − d 1 / 2 / 3 ) = ˜ Ω ( n − α ). Now , we must obt ain an estimate for the probabil ity that all other inner products with 1 are small. Let G i > 1 be the event that |h 1 , ˆ a i i| 6 p (2 + ρ ) log 1 / 2 n for all i ∈ [ n ] , i > 1 and for some ρ to be chosen later . W e will show that cond itioned on G 1 , G i > 1 occurs with probabil ity 1 − O ( n 1 − (2 + ρ ) / 2 ). Define 1 1 : = h 1 , ˆ a 1 i ˆ a 1 to be th e component of 1 paral lel to a 1 , let 1 ⊥ : = 1 − 1 1 be the compo nent of 1 orthog onal t o ˆ a 1 , and similarly let ˆ a ⊥ 2 , . . . , ˆ a ⊥ n be the compone nts of ˆ a 2 , . . . , ˆ a n orthogo n al to a 1 . Because 1 ⊥ is inde p endent of 1 1 , even conditioned on G 1 we ma y apply the s tandard t ail bound for univariate Gauss ians ( Lemma A.4 ), concluding that for all i > 1, P |h 1 ⊥ , ˆ a i i| > p (2 + ρ ) log 1 / 2 n G 1 = ˜ Θ ( n − (2 + ρ ) / 2 ) . Thus, a union bound over i , 1 allows us to con clude that conditioned on G 1 , with proba bility 1 − ˜ O ( n − ρ/ 2 ) every i ∈ [ n ] with i > 1 has |h 1 ⊥ , ˆ a ⊥ i i| 6 p (2 + ρ ) log 1 / 2 n . On the other hand, let ˆ a k 2 , . . . , ˆ a k n be t he components of the ˆ a i parallel to ˆ a 1 . W e compute the projection of ˆ a i onto ˆ a 1 . W ith over w helming probabili ty , h ˆ a 1 , ˆ a i i = h a 1 , a i i k a 1 k 2 · k a i k 2 = (1 ± ˜ O (1 / √ d )) · h a 1 , a i i w .ov .p. by k a i k , k a 1 k = 1 ± ˜ O (1 / √ d ) ( Fact 5.7 ) = (1 ± ˜ O (1 / √ d )) · ˜ O (1 / √ d ) w .ov .p. by h a 1 , a i i = ˜ O (1 / √ d ) ( Fact 5.7 ) , Thus w .ov .p., h 1 1 , ˆ a k i i = h 1 , ˆ a 1 i · h ˆ a 1 , ˆ a i i 6 h 1 , ˆ a 1 i · ˜ O (1 / √ d ) , 27 for all i ∈ [ n ]. Now we can analyze G i > 1 . T aking a union bound o ver the overwhelmingly probable events (including k a i k 6 1 + ˜ O (1 / √ d )) and the e ve n t t hat h 1 ⊥ , a i i is small for all i , we have that with probab ility 1 − ˜ O ( n − ρ/ 2 ), for e very i ∈ [ n ] with i > 1, |h 1 , ˆ a i i| 6 |h 1 ⊥ , ˆ a i i| + |h 1 1 , ˆ a i i| p (2 + ρ ) log 1 / 2 n + ˜ O (1 / √ d ) · h 1 , ˆ a 1 i 6 p (2 + ρ ) log 1 / 2 n + ˜ O (1 / d 1 / 4 ) . W e conclude that P ( G 1 , G i > 1 ) = P ( G i > 1 |G 1 ) · P ( G 1 ) > (1 − O ( n − ρ/ 2 )) · ˜ Ω ( n − α ) Setting ρ = 2 log log n log n and α = (1 + ε ) 2 (1 + log log n / log n + ˜ O (1 / √ d )), th e conclusion follows . 5.3 Bound for cross terms: pr oof of Proposition 5.5 W e proceed to the bo u nd o n the cross terms M cros s . Proposition (Restate me nt of Proposition 5.5 ) . Let a 1 , . . . , a n be independent random vectors from N (0 , 1 d Id d ) , and let 1 be a random vector fr om N (0 , I d d ) . Let T : = P i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ . Let M cros s : = P i , j ∈ [ n ] h 1 , T ( a i ⊗ a j ) i a i a ⊤ i ⊗ a j a ⊤ j . Suppose n > d. Then with w .ov .p., k M cros s k 6 ˜ O n 3 d 4 ! 1 / 2 . The p roof will use two iterations of Matrix R ade macher bounds . The first st ep will be to employ a classical decou pling inequality that has previously been u sed in a te nsor de composition context [ GM15 ]. Theorem 5.13 (Sp ecial Case of Theorem 1 in [ dlPMS95 ]) . Let { s i } , { t i } be independe nt iid sequences of random signs. Let { M i j } be a family of matrices. Ther e is a un iversa l constant C so that for every t > 0 , P X i , j s i s j M i j op > t 6 C · P C X i , j s i t j M i j op > t . Once the simplified cross terms are de coupled, we can use a matrix Rade macher bound on o ne set of s igns. Theorem 5.14 (Adapted from Theor em 4.1.1 in [ T ro12 ] 9 ) . Consid er a finite sequence { M i } of fixed m × m Hermitian matrices. Let s i be a sequence of independent sign variable s. Let σ 2 : = k P i M 2 i k . T hen for every t > 0 , P X i s i M i op > t 6 2 m · e − t 2 / 2 σ 2 . 9 W e r emark that T ropp’s bound is phrased in terms of λ max P i s i M i . Since λ max P i s i M i = λ min P i − s i M i , and the distribution of s i M i is negation-invariant, the result we state here follows from an easy union bound. 28 Also, E X i s i M i 6 q 8 σ 2 log d . Corollary 5.15. Let s 1 , . . . , s n be independent signs in {− 1 , 1 } . Let A 1 , . . . , A n and B 1 , . . . , B n be Hermetian matrices. Then w .ov .p., X i s i · A i ⊗ B i 6 ˜ O max i k B i k · X i A 2 i 1 / 2 . Pro of. W e us e a matrix Rademacher bou n d and standard manipulations: X i s i · A i ⊗ B i w .ov .p. 6 ˜ O X i A 2 i ⊗ B 2 i 1 / 2 6 ˜ O X i k B i k 2 · ( A 2 i ⊗ I d) 1 / 2 since A 2 i is PSD for all i 6 ˜ O max i k B i k 2 · X i A 2 i 1 / 2 since A 2 i ⊗ I d is PSD for all i . W e also need a few further concentration bound s on matrices which will come up as parts o f M cros s . These can be proved by s tandard inequalities for su ms of inde penden t matrices. Lemma 5.16 (Restatement of Fact C.2 and L emma C.3 ) . Let a 1 , . . . , a n be independent from N (0 , 1 d Id d ) with n > d polylog( d ) . W ith overwhe lming pr obabilit y , ˜ Ω ( n / d ) · Id P i ∈ [ n ] a i a ⊤ i ˜ O ( n / d ) · Id . Addit ion- ally , if 1 ∼ N (0 , Id d ) is independent of the rest, for every j ∈ [ n ] w .ov . p. X i ∈ [ n ] i , j h 1 , a i ik a i k 2 h a i , a j i · a i a ⊤ i 6 ˜ O ( n / d 2 ) 1 / 2 . Pro of of Pro positio n 5.5 . W e expand M cros s : M cros s = X i , j h 1 , T ( a i ⊗ a j ) i · a i a ⊤ i ⊗ a j a ⊤ j = X i , j X ℓ ∈ [ n ] h a ℓ , a i ih a ℓ , a j ih 1 , a ℓ i · a i a ⊤ i ⊗ a j a ⊤ j . Since the joint distr ibution of ( a 1 , . . . , a n ) is ide ntical to that of ( s 1 a 1 , . . . , s n a n ), this is distributed identically to M ′ cros s = X ℓ ∈ [ n ] X i , j s i s j s ℓ h 1 , a ℓ ih a ℓ , a i ih a ℓ , a j i · a i a ⊤ i ⊗ a j a ⊤ j , 29 (where we have also swapped the sums over ℓ and i , j ). W e sp lit M ′ cros s into M di ff , for which i , ℓ and j , ℓ , and M same , fo r which ℓ = i or ℓ = j , and bound the norm of each of t hese sums s eparately . W e beg in w ith M same . M same def = X i , j s 2 i s j h 1 , a i ih a i , a i ih a i , a j i · a i a ⊤ i ⊗ a j a ⊤ j + X i , j s 2 j s i h 1 , a j ih a j , a j ih a i , a j i · a i a ⊤ i ⊗ a j a ⊤ j . By a union bound and an application of the tr iangle ineq u ality it w ill be enough t o s h o w t h at just one of th e se two s u ms is ˜ O ( n 3 / d 4 ) 1 / 2 w .ov .p.. W e rewrite the left-hand one: X i , j s 2 i s j h 1 , a i ih a i , a i ih a i , a j i · a i a ⊤ i ⊗ a j a ⊤ j = X j ∈ [ n ] s j a j a ⊤ j ⊗ X i , j h 1 , a i ik a i k 2 h a i , a j i · a i a ⊤ i . Define M j def = X i , j h 1 , a i ik a i k 2 h a i , a j i · a i a ⊤ i so that no w we nee d to bound P j ∈ [ n ] s j a j a ⊤ j ⊗ M j . By Corol lary 5.15 , X j ∈ [ n ] s j a j a ⊤ j ⊗ M j w .ov .p. 6 ˜ O (max j k M j k ) · ˜ O X j ∈ [ n ] k a j k 2 a j a ⊤ j 1 / 2 6 ˜ O (max j k M j k ) · max j k a j k · ˜ O X j ∈ [ n ] a j a ⊤ j 1 / 2 In Lemma 5.16 , we bound max j k M j k 6 ˜ O ( n / d 2 ) 1 / 2 w .ov .p. using a matrix Bernstein inequality . Combining this bound with t h e concent ration of k a j k ar ound 1 ( Fact 5.7 ), we obtain w .ov .p. 6 ˜ O ( n / d 2 ) 1 / 2 · ˜ O ( n / d ) 1 / 2 = ˜ O ( n / d 1 . 5 ) . Having fin ish e d w ith M same , we turn to M di ff . k M di ff k = X ℓ , i , j s i s j s ℓ h 1 , a ℓ ih a ℓ , a i ih a ℓ , a j i · a i a ⊤ i ⊗ a j a ⊤ j = X ℓ s ℓ h 1 , a ℓ i X i , ℓ s i h a ℓ , a i i a i a ⊤ i ⊗ X j , ℓ, i s j h a ℓ , a j i a j a ⊤ j . Letting t 1 , . . . , t n and r 1 , . . . , r n be independe nt uniformly random signs, by Theorem 5.13 , it will be enough to bound the spe ctral norm after replacing the se cond and third occurrences o f s i for t i and r i . T o this end, we define M ′ di ff def = X ℓ s ℓ h 1 , a ℓ i X i , ℓ t i h a ℓ , a i i a i a ⊤ i ⊗ X j , ℓ, i r j h a ℓ , a j i a j a ⊤ j . 30 Let N ℓ def = X i , ℓ t i h a ℓ , a i i a i a ⊤ i ⊗ X j , ℓ, i r j h a ℓ , a j i a j a ⊤ j so that we ar e to bound P ℓ ∈ [ n ] s ℓ h 1 , a ℓ i · N ℓ . By a matrix Rademacher bound and elementary manipulations, X ℓ ∈ [ n ] s ℓ h 1 , a ℓ i · N ℓ w .ov .p. 6 ˜ O X ℓ ∈ [ n ] h 1 , a ℓ i 2 · N 2 ℓ 1 / 2 6 ˜ O ( √ n ) · max ℓ ∈ [ n ] |h 1 , a ℓ i| · max ℓ ∈ [ n ] k N ℓ k w .ov .p. 6 ˜ O ( √ n ) · max ℓ ∈ [ n ] k N ℓ k since |h 1 , a i i| 6 ˜ O (1) ( Fact 5.7 ) . The rest of the proof is devot ed to bound ing k N ℓ k . W e s t art with Corollary 5.15 to ge t k N ℓ k w .ov .p. 6 ˜ O max i X j , ℓ, i r j h a ℓ , a j i · a j a ⊤ j · X i , ℓ h a ℓ , a i i 2 k a i k 2 · a i a ⊤ i 1 / 2 W e us e a matrix Rademacher bou n d for the left-hand matrix, X j , ℓ, i r j h a ℓ , a j i · a j a ⊤ j w .ov .p. 6 ˜ O X j , ℓ, i h a ℓ , a j i 2 k a j k 2 · a j a ⊤ j 1 / 2 6 ˜ O max j , ℓ h a ℓ , a j i 2 k a j k 2 X j a j a ⊤ j 1 / 2 w .ov .p. 6 ˜ O √ n d ! , where we have use d that h a ℓ , a i i 2 concentrates around 1 d ( Fact 5.7 ), that k a i k 2 concentrates around 1 ( Fact 5.7 ), and t h at P i a i a ⊤ i concentrates ar ound n d ( Lemma 5.16 ) within logarithmic factors al l with overwhelming probabi lity . For the right-hand matrix, we u se the fact t hat the s u mmands are PS D t o conclude that X i , ℓ h a ℓ , a i i 2 k a i k 2 · a i a ⊤ i 6 max i , ℓ h a ℓ , a i i 2 k a i k 2 · X i , ℓ a i a ⊤ i w .ov .p. 6 ˜ O (1 / d ) · ˜ O ( n / d ) , using the same concent ration facts as e arlier . Putting these toge t her , w .ov .p . k N ℓ k 6 ˜ O ( √ n / d ) · ˜ O ( √ n / d ) = ˜ O ( n / d 2 ) . 31 Now we ar e ready to make the final bound on M ′ di ff . W ith overwhelming probabi lity , k M ′ di ff k 6 ˜ O ( √ n ) · max ℓ ∈ [ n ] k N ℓ k 6 ˜ O ( n 3 / d 4 ) 1 / 2 and hence by Theorem 5.13 , k M di ff k 6 ˜ O ( √ n ) · max ℓ ∈ [ n ] k N ℓ k 6 ˜ O ( n 3 / d 4 ) 1 / 2 w .ov .p.. Finally , by triangle inequ ality and all our bound s t h u s far , w .ov .p. k M cros s k 6 k M same k + k M di ff k 6 ˜ O ( n / d 1 . 5 ) + ˜ O ( n 3 / d 4 ) 1 / 2 6 ˜ O ( n 3 / d 4 ) 1 / 2 . 5.4 Full algorithm and proof of Theorem 5.2 In this subsection we give the full d etails of our tenso r decompo sition algorithm. As discusse d above, the algorithm proceeds by const ructing a random matrix from the input tenso r , then computing and post-processing its top eigenvecto r . Spectral T ensor Decomposition (One Attempt) This is the main subroutine of our al gorithm—we will run it ˜ O ( n ) times and show that this recovers all o f the components a 1 , . . . , a n . Algorithm 5.17. Input: T = P n i = 1 a i ⊗ a i ⊗ a i . Goal: Re cover a i for so me i ∈ [ n ]. • Comput e the matrix unfolding T ∈ R d 2 × d of T . Then comput e a 3-tens or S ∈ R d 2 × d 2 × d 2 by starting with th e 6-tensor T ⊗ T , permuting indices, and flattening t o a 3-tens or . Apply T in one mode of S to obt ain M ∈ R d ⊗ d 2 ⊗ d 2 , s o th at: T = X i ∈ [ n ] a i ( a i ⊗ a i ) ⊤ , S = T ⊗ 2 = n X i , j = 1 ( a i ⊗ a j ) ⊗ 3 , M = S ( T , Id d 2 , Id d 2 ) = X i , j ∈ [ n ] T ( a i ⊗ a j ) ⊗ ( a i ⊗ a j ) ⊗ ( a i ⊗ a j ) . • Sample a vecto r 1 ∈ R d with iid standar d gaussian entries. Eval uate M in its first mode in the direction of 1 to o bt ain M ∈ R d 2 × d 2 : M : = M ( 1 , Id d 2 , Id d 2 ) = X i , j ∈ [ n ] h 1 , T ( a i ⊗ a j ) i · ( a i ⊗ a j )( a i ⊗ a j ) ⊤ • L et Σ def = E [( aa ⊤ ) ⊗ 2 ] for a ∼ N (0 , Id d ). Let R def = √ 2 · ( Σ + ) 1 / 2 . Compute the to p eigenvector u ∈ R d 2 of RMR , and reshape Ru to a matrix U ∈ R d × d . • F or each of the signings of the top 2 unit left (or right) s ingular vectors ± u 1 , ± u 2 of U , check if P i ∈ [ n ] h a i , ± u j i 3 > 1 − c ( n , d ) whe re c ( n , d ) = Θ ( n / d 3 / 2 ) is an appropriate thr eshold. I f so, outp ut ± u j . Otherwise outp ut nothing. 32 Theorem 5.3 gets us most of the way to the cor rectness of Algo rithm 5.17 , p rovi ng t hat the to p eigenvector of the matrix RMR is correlated with some a ⊗ 2 i with r easonable probabili ty . W e need a few more ing redients to p rove Theorem 5.2 . First, we nee d to show a bound o n the runtime of Algorithm 5.17 . Lemma 5.18. Algorithm 5.17 can be implemented in time ˜ O ( d 1 + ω ) , wher e d ω is the runtime for multiplyi ng two d × d matrices. It may also be implemented in time ˜ O ( d 3 . 257 ) . Pro of. T o run the algorithm, we only require access t o power iteration using the matrix RMR . W e first give a fast implementation for power iteration with the matrix M , and handle the multiplica- tions with R separately . Consider a vecto r v ∈ R d 2 , and a random vector 1 ∼ N (0 , Id d ), and let V , G ∈ R d × d be the reshapings of v and 1 T respectively into matrices. Call T v = T (Id d , V , G ), whe re we have applied V and G in the second and third mode s of T , and call T v the reshaping o f T v into a d × d 2 matrix. W e have that T v = X i ∈ [ n ] a i ( V a i ⊗ G a i ) ⊤ W e show that the matrix-vector multiply Mv can be compu t ed as a flattening o f the following product: T v T ⊤ = X i ∈ [ n ] a i ( V a i ⊗ G a i ) ⊤ X j ∈ [ n ] ( a j ⊗ a j ) a ⊤ j = X i , j ∈ [ n ] h a j , V a i i · h a j , Ga i i · a i a ⊤ j = X i , j ∈ [ n ] h a i ⊗ a j , v i · h 1 T , a i ⊗ a j i · a i a ⊤ j Flattening T v T ⊤ fr om a d × d matrix to a vecto r v T T ∈ R d 2 , we have that v T T = X i , j ∈ [ n ] h 1 T , a i ⊗ a j i · h a i ⊗ a j , v i · a i ⊗ a j = Mv . So we have that Mv is a flattening of the product T v T ⊤ , which we will compute as a proxy for computing Mv via direct multiplication. Computing T v = T (Id , V , G ) can be do n e with two matrix multiplication ope rations, both times multiplying a d 2 × d matrix with a d × d matrix. Computing T v T ⊤ is a multiplication of a d × d 2 matrix by a d 2 × d matrix. Both these steps may be done in t ime O ( d 1 + ω ), by regarding the d × d 2 matrices as block matric es with blocks of size d × d . Alte rnatively , the asymptotically fastest known algorithm for rectangular matrix multiplication gives a time of O ( d 3 . 257 ) [ LG12 ]. Now , to compute the matrix-vector multiply RMRu for any vector u ∈ R d 2 , we may firs t compute v = Ru , p erform the operation Mv in time O ( d 1 + ω ) as described above, and then agai n multiply by R . T he ma trix R is sparse: it has O ( d ) entries per r ow (see Fact C.4 ), so the multiplication Ru requir es time O ( d 3 ). Performing the update RMRv a total of O (log 2 n ) times is su ffi cient for convergence, as we have that with reasonable probability , the sp e ctral gap λ 2 ( RMR ) /λ 1 ( RMR ) 6 1 − O ( 1 log n ), as a r esult of applying Theorem 5.3 with the choice of ε = O ( 1 log n ). 33 Finally , check ing the value of P i h a i , x i 3 requir es O ( d 3 ) o perations, and we d o so a constant number of t imes, o nce for each of the signings of the top 2 left (or right) singular vectors o f U . Next, we need to show t hat g iven u with h Ru , a i ⊗ a i i 2 > (1 − ˜ O ( n 3 / 2 /ε d 2 )) · k u k 2 · k a i k 4 we can actually r ecover the tensor compo nent a i . Here Algorithm 5.17 reshapes Ru t o a d × d matrix and checks the t op two left- or right-singular vectors ; the n e xt lemma show s one of t hese singular vectors must be highly correla ted with a i . (The proof is defe r red to Section A.1 .) Lemma 5.19. Let M ∈ R d 2 × d 2 be a symmetric matrix with k M k 6 1 , and let v ∈ R d and u ∈ R d 2 be vectors. Furthermor e, let U be the re shaping of the vector Mu ∈ R d 2 to a matrix in R d × d . Fix c > 0 , and supp ose that h Mu , v ⊗ v i 2 > c 2 · k u k 2 · k v k 4 . Then U has some left singular vector a and some right singular vector b such that |h a , v i| , |h b , v i| > c · k v k . Furthermor e, for any 0 < α < 1 , ther e are a ′ , b ′ among the top ⌊ 1 α c 2 ⌋ singular vectors of U with |h a ′ , v i| , |h b ′ , v i| > √ 1 − α · c · k v k . If c > q 1 2 (1 + η ) for some η > 0 , then a , b are amongst the top ⌊ (1 + η ) η c 2 ⌋ singular vectors. Since here c 2 = 1 − o (1), we can choose η = 1 − o (1) and check only t he t op 2 singular vectors . Next, we must sho w how to choose th e threshold c ( n , d ) so that a big enough val ue P i ∈ [ n ] h a i , u j i 3 is en s ures th at u j is close to a tensor componen t . The pr oof is at the end of t h is section. (A ve r y similar fact appears in [ GM15 ]. W e ne e d a s omewhat di ff erent parameterization here, but we r euse many of their results in the proof.) Lemma 5.20. Let T = P i ∈ [ n ] a i ⊗ a i ⊗ a i for normally distributed vectors a i ∼ N (0 , 1 d Id d ) . For all 0 < γ, γ ′ < 1 , 1. With overwhelming probabi lity , for every v ∈ R d such that P i ∈ [ n ] h a i , v i 3 > 1 − γ , max i ∈ [ n ] |h a i , v i| > 1 − O ( γ ) − ˜ O ( n / d 3 / 2 ) . 2. With over whelming pr obabilit y ove r a 1 , . . . , a n if v ∈ R d with k v k = 1 satis fies h v , a j i > 1 − γ ′ for some j then P i h a i , v i 3 > 1 − O ( γ ′ ) − ˜ O ( n / d 3 / 2 ) . W e are now ready to prove Theorem 5.2 . Pro of of Theor em 5.2 . By Theorem 5.3 , with probabi lity 1 − o (1) over a 1 , . . . , a n there are events E 1 , . . . , E n so t hat P 1 ( E i ) > ˜ O (1 / n 1 + O ( ε ) ) s uch t hat when e ve n t E i occurs the top eigenvecto r u of RMR s atisfie s h Ru , a i ⊗ a i i 2 k u k 2 · k a i k 4 > 1 − ˜ O n 3 / 2 ε d 2 ! . For a particular sample 1 ∼ N (0 , Id d ), let u 1 be this eigenvector . The algorithm is as follows. Sample 1 1 , . . . , 1 r ∼ N (0 , Id d ) independe ntly for some r to be chosen later . Compute Ru 1 1 , . . . , Ru 1 r , reshape e ach to a d × d matrix, and compute its singular value d e composition. This gives a family of (right) s ingular vectors v 1 , . . . , v dr . For each, e valuate 34 P i h a i , v j i 3 . Let c ( n , d ) be a threshold to be chosen later . Initialize S ⊂ R d to t he empty s et. Examining each 1 6 j 6 dr in turn, add v j to S if P i h a i , v j i 3 > 1 − c ( n , d ) and for every v already in S , h v , v j i 2 6 1 / 2. Output the set S . Choose ε = 1 / log n . By Lemma 5.19 , when E i occurs for 1 j one of v ∈ {± v jr , . . . , ± v ( j + 1) r } has h v , a i i > (1 − ˜ O ( n 3 / 2 / d 2 ))( k u j k 2 · k a j k 4 ). Then by Lemma 5.20 , when E i occurs for 1 j , this v we will have P i h a i , ± v i 3 > 1 − ˜ O ( n / d 3 / 2 ). Choose c ( n , d ) = ˜ Θ ( n 3 / 2 / d 2 ) so that when E i occurs for 1 j , so long as it has no t p revi ously occurred for some j ′ < j , the algorithm adds ± v to S . The events E ( t ) i and E ( t ′ ) i ar e indepe ndent for any two e xecutions of the algorithm t and t ′ and have pr obabil ity ˜ Ω (1 / n ). Thus, after r = ˜ O ( n ) e xecutions o f the algorithm, w ith high pr obabili ty for every i ∈ [ n ] t here is j ∈ [ r ] so that E i occurs for 1 j . Finally , by Lemma 5.20 , the algorithm can never add t o S a vecto r which is not (1 − ˜ O ( n / d 3 / 2 ))-close t o s ome a i . It just rema ins to prove Lemma 5.20 . Pro of of Lemma 5.20 . W e s tart with the first claim. By [ GM15 , Le mma 2, (proof o f) Lemma 8, Theorem 4.2], the following inequalities all hold w .ov .p.. X i ∈ n h a i , x i 4 6 1 + ˜ O ( n / d 3 / 2 ) for all k x k = 1 , (5.6) X i ∈ [ n ] h a i , x i 6 > 1 − O X i ∈ [ n ] h a i , x i 3 − 1 − ˜ O ( n / d 3 / 2 ) for all k x k = 1 , (5.7) X i ∈ [ n ] h a i , x i 3 6 1 + ˜ O ( n / d 3 / 2 ) for all k x k = 1 . (5.8) T o beg in, X i ∈ [ n ] h a i , v i 6 6 max i ∈ [ n ] h a i , v i 2 ! · X i ∈ [ n ] h a i , v i 4 . By (5.6) , this implies max i ∈ [ n ] h a i , v i 2 > (1 − ˜ O ( n / d 3 / 2 )) · X i ∈ [ n ] h v , a i i 6 . (5.9) Now combining (5.7) with (5.9) we have max i ∈ [ n ] h a i , v i 2 > (1 − ˜ O ( n / d 3 / 2 )) · (1 − O (1 − X i h a i , v i 3 ) − ˜ O ( n / d 3 / 2 )) . T ogethe r with (5.8) this conclude s t he of the first claim. For the se cond claim, we no t e that by (5.8) , and homoge neity , | P i , j h a i , x i 3 | 6 k x k 3 (1 + ˜ O ( n / d 3 / 2 ) w .ov .p.. W e write v = h a j , x i a j + x ⊥ , where h x ⊥ , a j i = 0. Now we expand X i h a i , v i 3 > (1 − γ ′ ) 3 + X i , j hh a j , x i a j + x ⊥ , a i i 3 = (1 − γ ′ ) 3 + X i , j h a j , x i 3 h a j , a i i 3 + 3 h a j , x i 2 h a j , a i i 2 h x ⊥ , a i i + 3 h a j , x ih a j , a i ih x ⊥ a i i 2 + h x ⊥ , a i i 3 . 35 W e e s timate each te rm in the exp ansion: X i , j h a j , x i 3 h a j , a i i 3 6 |h a j , x i 3 | X i , j |h a j , a i i| 3 6 ˜ O n d 3 / 2 w .ov .p. by Cauchy-Schwarz and st andar d conce n t ration. X i , j h a j , x i 2 h a j , a i i 2 h x ⊥ , a i i 6 X i , j h a j , x i 4 h a j , a i i 4 1 / 2 X i , j h x ⊥ , a i i 2 1 / 2 by Cauchy-Schwarz 6 O ( √ n ) · max i , j h a j , a i i 2 · ˜ O n d 1 / 2 w .ov .p. by standard concentration. 6 ˜ O n d 3 / 2 w .ov .p. by stand ard concent ration X i , j h a j , x ih a j , a i ih x ⊥ , a i i 2 6 O (1) · max i , j |h a j , a i i| · X i , j h x ⊥ , a i i 2 w .ov .p. by stand ard concent ration 6 ˜ O 1 √ d ! · ˜ O n d w .ov .p. by standard concentration 6 ˜ O n d 3 / 2 X i , j h x ⊥ , a i i 3 6 γ ′ + ˜ O n d 3 / 2 w .ov .p. by (5.8) and homo g eneity . Now we estimate X i h a i , v i 3 > (1 − γ ′ ) 3 + X i , j h a i , x i 3 > (1 − γ ) 3 − γ ′ − ˜ O ( n / d 3 / 2 ) > 1 − O ( γ ′ ) − ˜ O ( n / d 3 / 2 ) . since γ ′ < 1. 5.4.1 Boosting Accuracy with Local Search W e remark that Algorithm 5.17 may be u sed in conjunction with a local search algorithm to obtain much g reater guarantees on the accuracy of the recovered vectors. Previous p rogr ess on the tensor decompos ition problem has pr oduced iterative algorithms that provide local convergence guarantees given a good enough ini tializati on, but w hich leave the question of how to initializ e the procedure up to future work, or up to t he specifics of an implementation. In this cont ext, our contribution can be seen as a general method of obtaining good initializations fo r t h e se local iterative procedur es. In particular , Anandkumar et al. [ AGJ15 ] give an algorithm that combines te nsor power itera- tion and a form o f coo rdina te desce n t , which when initializ ed with the ou tput of Algorithm 5.17 , achieves a linear convergence r ate to t he true decompos ition within polynomial time. Theorem 5.21 (Adapte d from Theorem 1 in [ AGJ15 ]) . Given a rank-n tensor T = P i a i ⊗ a i ⊗ a i with random Gaussian compo nents a i ∼ N (0 , 1 d Id d ) . There is a constant c > 0 so that if a set of u nit vect ors { x i ∈ R d } i satisfies h x i , a i i > 1 − c , ∀ i ∈ [ n ] , 36 then ther e exists a procedur e which w ith overwhelming pr obabilit y over T and for any ε > 0 , reco vers a set of vectors { ˆ a i } such that h ˆ a i , a i i > 1 − ε, ∀ i ∈ [ n ] , in time O (poly( d ) + nd 3 log ε ) . Remark 5.22 . Theorem 1 of Anand k umar et al. is stated for random asy mmetric tensors, but the adaptation to symmetric tensors is stated in equ ations (14) and (27) in the s ame paper . The the orem o f Anandk umar et al. allows for a pert urbation tensor Φ , which is just the zer o tensor in our sett ing. Add itionally , the weigh t ratios specifying t he we ight of each rank-one component in t he input tensor ar e w max = w min = 1. Lastly , the initializa tion conditions ar e given in te rms of the distance between the intializati on vectors and the true vectors | x i − a i | , which is rela ted to our measure of closene s s h x i , a i i by the equation | x i − a i | 2 = | x i | 2 + | a i | 2 − 2 h x i , a i i . The linear convergence guarantee is stated in Lemma 12 of Anandkumar et al. Corollary 5.23 (Coroll ary of Theorem 5.2 ) . G iven as input the tensor T = P n i = 1 a i ⊗ a i ⊗ a i wher e a i ∼ N (0 , 1 d Id d ) with d 6 n 6 d 4 / 3 / polylog d, ther e is a polynomial-time algorithm which with proba bility 1 − o (1) over the input T and the algorithm randomness finds unit vectors ˆ a 1 , . . . , ˆ a n ∈ R d such that for all i ∈ [ n ] , h ˆ a i , a i i > 1 − O 2 − n . Pro of. W e repeatedly invoke Algorithm 5.17 until we obtain a full se t o f n vectors as characterized by Theorem 5.2 . App ly Theorem 5.21 to the recover ed set of vecto rs un t il the d esired accuracy is obtained. 6 T ensor principal component analysis The T ensor PCA pr oblem in the sp iked te n s or mode l is simila r to the setting of tenso r decompos i- tion, but h e re t he go al is to recover a single la rge component w ith all smaller comp o nents of t he tensor regar ded as random no ise . Problem 6.1 (T ensor PCA in the Order-3 Spiked T e n s or Model) . Given an input tensor T = τ · v ⊗ 3 + A , where v ∈ R n is an arbitrary unit vector , τ > 0 is the signal-to-noise ratio, and A is a random no ise tensor with iid standard Gaussian e ntries, recover t he signal v app roximately . Using th e partial trace meth o d, we give the first linear -time algorithm for this pr oblem that recovers v for signal-to-noise ratio τ = O ( n 3 / 4 / poly log n ). In add ition, the algorithm requir es on ly O ( n 2 ) auxiliary space (compared to t he input size of n 3 ) and us es only one no n -adaptive p ass over the input. 6.1 Spiked tensor model This spiked te nsor model (for ge neral order - k tenso rs) was intr oduced by Montanari and R ichar d [ RM14 ], who also obt ained the first algorithms to solve the mode l with pr ovable statistical guar - antees. Subseque ntly , the SoS approach was applied to the mod el to impr ove the s ign al-to-noise ratio r equired for odd-order tenso rs [ HSS15 ]; for 3-tens ors r educing the requir ement fr om τ = Ω ( n ) to τ = Ω ( n 3 / 4 log( n ) 1 / 4 ). 37 Using the linear -algebraic objects involved in the anal ysis of the SoS relaxation, the previous work has also d escribed algorithms with guarantees similar to those of the SoS SDP r elaxation, while requiring only nearly subquadratic or linear time [ HSS15 ]. The algorithm he re improves on t h e p revi ous results by us e of the partial trace method, simpli- fying th e analysis and improving the runtime by a factor of log n . 6.2 Linear-time algorithm Linear -T ime Algo rithm for T ensor PCA Algorithm 6.2. Input: T = τ · v ⊗ 3 + A . Goal: Recover v ′ with h v , v ′ i > 1 − o (1). • Comput e the partial trace M : = T r R n P i T i ⊗ T i ∈ R n × n , where T i ar e t he first-mod e s lices of T . • Outp ut the top eige nvector v ′ of M . Theorem 6.3. When A has iid standard Gaussian entries and τ > C n 3 / 4 log( n ) 1 / 2 /ε for some constant C , Algorith m 6.2 r ecovers v ′ with h v , v ′ i > 1 − O ( ε ) with high probabi lity over A . Theorem 6.4. Algorithm 6.2 can be implemented in linear time and sublinear space. These theorems are proved by routine matrix concentration results, s howing th at in the partial trace matrix, the signal dominates t h e no ise. T o implement the algorithm in linear time it is enoug h to show that this (sublinear -s ized) matrix has const ant sp e ctral gap; then a st and ard application of the matrix power metho d computes the top eigenvector . Lemma 6.5. For any v, with high pr obabilit y over A , the following occur: X i T r( A i ) · A i 6 O ( n 3 / 2 log 2 n ) X i v ( i ) · A i 6 O ( √ n log n ) X i T r( A i ) v ( i ) · vv ⊤ 6 O ( √ n log n ) . The proof may be found in Appendix D . Pro of of Theor em 6.3 . W e expand the partial trace T r R n P i T i ⊗ T i . T r R n X i T i ⊗ T i = X i T r( T i ) · T i = X i T r( τ · v ( i ) vv ⊤ + A i ) · ( τ · v ( i ) vv ⊤ + A i ) = X i ( τ v ( i ) k v k 2 + T r( A i )) · ( τ · v ( i ) vv ⊤ + A i ) 38 = τ 2 vv ⊤ + τ X i v ( i ) · A i + X i T r( A i ) v ( i ) vv ⊤ + X i T r( A i ) · A i . Applying Lemma 6.5 and the t riangle inequality , we s ee that τ X i v ( i ) · A i + X i T r( A i ) v ( i ) vv ⊤ + X i T r( A i ) · A i 6 O ( n 3 / 2 log n ) with high probabi lity . Thus, for appropriate choice o f τ = Ω ( n 3 / 4 p (log n ) /ε ), the matrix T r R n P i T i ⊗ T i is close to rank one , and the result follows by standard manipulations. Pro of of Theor em 6.4 . Ca rrying over t he expans ion of the partial t race from above and setting τ = O ( n 3 / 4 p (log n ) /ε ), the matrix T r R n P i T i ⊗ T i has a spectral gap ratio equal to Ω (1 /ε ) and so th e matrix power metho d finds t h e top e igenvector in O (log( n /ε )) iterations. This matrix has dimension n × n , so a single iteration take s O ( n 2 ) t ime, which is sublinear in the input size n 3 . Finally , to construct T r R n P i T i ⊗ T i we u se T r R n X i T i ⊗ T i = X i T r( T i ) · T i and note that to const ruct the right -hand side it is enoug h to examine each entry of T just O (1) times and perform O ( n 3 ) additions. At no point do we need to store more than O ( n 2 ) matrix entries at the same t ime. Acknowledgements W e w o uld like to thank Rong Ge for very helpful discussions. W e also thank Jonah B rown Cohen, Pasin Manurangsi and A viad Rubinstein for helpful comments in the preparation of t h is manuscript. References [AFH + 15] An ima Anandkumar , Dean P . Fos t er , Daniel Hsu, Sham M. Kakade, and Y i-Kai Liu, A spectra l algor ithm for latent dirich let allocation , A lgo rithmica 72 (2015), no. 1, 193–214 . 3 [AGH + 14] Animashr ee Anandku mar , Rong Ge, Daniel Hsu, Sh am M. Kakade, and Matus T elgar- sky , T ensor decompositio ns for learning latent variable models , Journal of Machine L earning Research 15 (2014 ), no. 1, 2773–2832 . 3 [AGHK14] Anima shree Anandkumar , R ong Ge, Daniel Hsu, and Sham M. Kakade, A tensor appr oach to learning mixed m embership comm unity models , Journal of Mac hine Learning Research 15 (2014 ), no. 1, 2239–2312 . 3 [AGJ14] Anima Anandkumar , Rong Ge, and Majid Janzam in, Analyzing tensor power method dynamics: Applicatio ns to learning over complet e latent variabl e models , CoRR abs / 1411.1488 (2014 ). 3 39 [AGJ15] Animashr ee Anandkumar , Rong Ge, and Majid Janzam in, L earning over complete latent variabl e models thr ough tensor methods , Proceedings of T he 28th Conference on Le arning Theory , COL T 2015, Paris, France, July 3-6, 2015, p p. 36–11 2. 3 , 4 , 20 , 36 [A W02] Rudolf Ahlswede and Andreas J. W inter , Strong converse for identificat ion via quantum channels , I E EE T ransactions on Information Theo ry 48 (2002), no. 3, 569–579 . 48 [BBH + 12] Boaz Barak, Fernando G. S. L. Brandão, Aram W e t troth Harrow , Jo n athan A. Kelner , David Steurer , and Y uan Zhou, Hyper contractiv ity , sum-of-squar es pr oofs, and their ap- plicatio ns , Proceedings of the 44th Sy mposium on Theor y of Computing Confer ence, STOC 2012, N ew Y ork, N Y , USA, May 19 - 22, 2012, pp. 307–326. 1 , 8 [BCMV14] Aditya Bhaskara, Mos es Charikar , Ankur Moitra, and Aravindan V ija yaraghavan, Smoothed analysis of tensor decompositions , Sy mposium o n The ory of Computing, STOC 2014, N ew Y ork, NY , USA , May 31 - June 03, 2014, pp. 594–60 3. 3 [BKS14] Boaz B arak, Jonathan A . Kelner , and David Steurer , Rounding sum-of-square s rela xations , Symposium on Theory of Compu t ing, STOC 2014, New Y ork , NY , US A , May 31 - June 03, 2014, pp . 31–40. 1 , 2 , 3 , 5 , 6 , 8 , 43 [BKS15] , Dictionary learning and tensor decompos ition via the sum-of-squar es method , Pro- ceedings o f the Fo r t y-Seventh Annual ACM on S ymposium on T h e ory of Computing, STOC 2015, Po rtland, OR, USA, June 14-17, 2015, p p. 143–151 . 1 , 3 , 5 , 8 , 10 , 11 [BM15] Boaz Barak and Ankur Moitra, T ensor pred iction, rademacher complexity and random 3-xor , CoRR abs / 1501.06521 (201 5). 1 , 6 [BRS11] Boaz Barak, Prasad Raghavendra, and David Steurer , Rounding semidefinite pr ogram- ming hierar chies via global corr elation , IEEE 52nd Annual Symposium on F oundations of Computer Science, FOCS 2011 , Palm Springs, CA, USA, October 22-25 , 2011, pp. 472– 481. 6 , 8 [BS14] Boaz Barak and David St eurer , Sum -of-squar es pr oofs and the quest toward optimal algo- rithms , CoRR abs / 1404.5236 (2014). 1 [Cha96] J. T . Chang, Full reconstr uction of markov models on ev olutionary trees : Identifiability and consistency , Math Bios ci. 137 (1996), 51–73. 3 [dGJL04] A lexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Ger t R. G. Lanckriet, A dir ect fo rmulation fo r sparse P C A using semidefinite prog ramming , Advances in Neur al Information Processing Syst ems 17 [Ne u ral Information Processing Syste ms, NIPS 2004, December 13-18, V ancouver , British Columbia, Canada], 2004, pp. 41–48. 2 [DH14] Laurent Demanet and Paul Hand, Scaling law for re covering the sparsest element in a subspace , Information and I nference 3 (2014 ), no. 4, 295–309. 2 , 3 [dlPMS95] V ictor H de la Peña and Stephen J Montg omery-Smith, Decoupling inequalities for the tail proba bilities of multivariate u-statistics , The Annals of Probabil ity (1995), 806–816 . 28 40 [dlVK07] W ence s las Fernandez de la V ega and Clair e Ke nyon-Mathieu, L inear pr ogramming r e- laxatio ns of m axcut , Proceedings of t he E ighteenth Annu al ACM-SIAM Sympo sium on Discr ete Algorithms, SODA 2007, New Orleans, Louisiana, US A , January 7-9, 2007 , pp. 53–61. 6 [GHK15] Ro ng Ge, Qingqing Hu ang, and Sham M. Kakade, Learning m ixtur es of gaussians in high dimensions , Proceedings of the F orty-Sevent h An n u al ACM on Sympo sium on Theory of Computing , STOC 2015, Portland, OR, USA, June 14-17, 2015, pp . 761–770. 3 [GM15] Rong Ge and T eng yu Ma, Decomposi ng over complete 3rd order tensors using sum-of-square s algori thms , App roxima tion, Rando mization, and Combinatoria l Optimization. A lgo - rithms and T echniques, APPROX / RANDOM 2015, August 24-26 , Princeton, NJ, US A , 2015, p p. 829–849. 1 , 3 , 4 , 5 , 10 , 11 , 25 , 28 , 34 , 35 [GS12] V enkates an Guruswami and Ali Kemal Sinop, Faster SDP hierar chy solver s for local r ounding alg orithms , 53rd Annu al IE EE Sy mposium on Foundations o f Computer Sci- ence, FOCS 2012, New Brunswick, NJ, USA, October 20-23, 2012, pp. 197–20 6. 6 [GVX14] Navin Goyal, S antosh V e mpala, and Y ing Xiao, Fourier PCA and r obust tensor decompo- sition , Sympos ium on Theory of Computing, STOC 2014, New Y ork, NY , USA, May 31 - June 03, 2014, pp. 584–59 3. 3 [Har70] Richar d A Harshman, Foundations of the P AR A F AC pr ocedur e: Models and conditions for an “explanatory ” multi-modal factor analysis , UCLA W orking Papers in Pho netics 16 (1970 ), 1–84. 3 [Hås90] Johan Håstad, T ensor rank is np-complete , J. Algorithms 11 (1990), no . 4, 644–654. 3 [HL13] Christopher J. Hill ar and Lek-He ng Lim, Most te nsor probl ems are np-hard , J. ACM 60 (2013 ), no. 6, 45. 3 [HSS15] Samuel B. Hopk ins , Jonathan Shi, and Davi d Ste urer , T ensor principal component analysis via sum-of-squar e pro ofs , Proceedings of The 28th Conference on Le arning Theory , COL T 2015, Paris, France, July 3-6, 2015, pp. 956–100 6. 1 , 5 , 6 , 12 , 37 , 38 [Jan97] S vante Janson, Gaussian hilbert spaces , vol. 129, Cambridge university press, 1997. 46 [KB09] T amara G. Kolda and Brett W . Bader , T ensor decompositions and applications , SIAM Review 51 (2009), no. 3, 455–500. 3 [Las01] Je an B. Lasserre, G lobal optimizati on with polynomials and the pro blem of moments , SIAM J. Optim. 11 (2000 / 01), no. 3, 796–817. MR 1814045 (2002b:900 54) 1 [LCC07] Lieven De Lathauwer , Jos éphine Castaing, and Jean-François Car doso, Fourth-ord er cumulant-based blind identification of un derdet ermined mixtures , IEEE T ransactions on Signal Processing 55 (2007 ), no. 6-2, 2965–29 73. 3 [LG12] F . Le Gall , F aster algorith ms for r ectangular matrix mu ltiplicat ion , Foundations of Com- puter S cience (FOCS), 201 2 IEEE 53rd A n n u al Sympos ium on, Oct 2012, pp. 514 –523. 33 41 [LM00] B. Laurent and P . Massart, A daptive estimation of a quadratic functional by model selection , Ann. Statist. 28 (2000 ), no. 5, 1302–1338 . 52 [LR15] Elaine L evey and Tho mas Roth vos s, A lasserr e-based (1 + ε ) -appr oximation for P m | p j = 1 , pr ec | C max , CoRR abs / 1509.07 808 (2015) . 8 [MR06] Elchanan Moss el and Sébastien Roch, Learning nonsingular phylogenies and hidden markov models , Ann. Appl. Probab. 16 (2006 ), no. 2, 583–614. 3 [MW13] Raghu Meka and A vi W igders o n, Association schemes, non-commutative polynomial con- centration, and sum-of-squar es lower bounds for planted clique , CoRR abs / 1307.7615 (2013). 8 [MW15] T engyu Ma and A vi W igde rson, Su m-of-squar es lower bounds for sparse PCA , CoRR abs / 1507.0 6370 (2015). 6 [Nes00] Y urii N e sterov , Squared functional systems and optimization proble ms , H igh performance optimization, Appl. Optim., vol. 33, Kluwer Acad. Publ., Dordr echt, 2000, pp. 405–440. MR 1748764 (2001b:9006 3) 1 [NR09] Phong Q. Nguyen and Ode d Re gev , Learning a parallelep iped: Cryptanaly sis of GGH and NTRU signatures , J. Crypt ology 22 (2009), no. 2, 139–160 . 3 [Par00] Pablo A Parril o, Structured semidefinite progr ams and semialgebraic geomet ry method s in r obustness and optimization , P h . D. thesis, Citeseer , 2000. 1 [QSW14] Qing Qu, Ju Sun, and John W right, F inding a spars e ve ctor in a subspace: Linear sparsity using alternati ng dir ections , A dvances in Neu r al Information Processing Syste ms 27: Annual Conference on Neural I nformation Pr ocessing Sy stems 2014 , December 8-13 2014, Mo ntreal, Que bec, Canada, 2014, pp. 3401–340 9. 2 , 3 [RM14] Emile Richar d and Andr ea Mont anari, A stati stical model for tensor PCA , Ad vances in Neural Information Pr ocessing Systems 27 : A nnual Confer ence on Neural Informa- tion Processing Systems 20 14, December 8-13 2014 , Montr eal, Quebec, Canada, 2014, pp. 2897–2 905. 4 , 5 , 12 , 37 [R T12] Prasad Raghavendra and Ning T an, Approxi mating csps with glo bal cardinalit y constraints using SDP hierar chies , Proceedings of the T we nty-Third Annual ACM-SIAM Sympo- sium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012 , pp. 373– 387. 8 [Sho87] N. Z. S hor , An approa ch to obtaining global extr ema in polynomial pro blems of mathematical pr ogramming , Kiberne tika (Kiev) (1987), no . 5, 102–106, 136. MR 931698 (89d:90202) 1 [SWW12] Daniel A. S p ielman, Huan W ang, and John W right, Exact recov ery of sparsely -used di c- tionaries , COL T 2012 - The 25th An n u al Conference on Learning Theory , June 25-2 7, Edinburgh, Scotland, 2012, pp. 37.1–37.18. 2 42 [T r o12] Joel A. T ropp, User-frie ndly tail boun ds for sums of random matrices , Found ations of Computational Mathe matics 12 (2012), no. 4, 389–434. 28 , 48 [V er10] Ro man V ershynin, Intr oduction to the non-asympt otic analys is of ra ndom matrice s , CoRR abs / 1011.3 027 (201 0). 51 , 52 , 53 , 60 A Additional preliminarie s A.1 Linear algebra Here we provide some lemmas in linear algebra. This firs t lemma is closely related to the sos Cauchy-Schwarz fr om [ BKS14 ], and the proof is essent ially the same. Lemma A.1 (PSD Cauchy-Schwarz) . Let M ∈ R d × d , M 0 and symmetri c. Let p 1 , . . . , p n , q 1 , . . . , q n ∈ R d . Then h M , n X i = 1 p i q ⊤ i i 6 h M , n X i = 1 p i p ⊤ i i 1 / 2 h M , n X i = 1 q i q ⊤ i i 1 / 2 . In app lications, w e will have P i p i q i as a single block of a lar ger block matrix containi ng al so the blocks P i p i p ⊤ i and P i q i q ⊤ i . Pro of. W e firs t claim t hat h M , n X i = 1 p i q ⊤ i i 6 1 2 h M , n X i = 1 p i p ⊤ i i + 1 2 h M , n X i = 1 q i q ⊤ i i . T o s ee this, just note t hat the right-hand side minus th e left is exactly h M , n X i = 1 ( p i − q i )( p i − q i ) ⊤ i = X i ( p i − q i ) ⊤ M ( p i − q i ) > 0 . The lemma follows no w be applying this ineq u ality t o p ′ i = p i h M , P n i = 1 p i p ⊤ i i 1 / 2 q ′ i = q i h M , P n i = 1 q i q ⊤ i i 1 / 2 . Lemma A.2 (Operator Norm Cauchy-Schwarz for Sums ) . Let A 1 , . . . , A m , B 1 , . . . , B m be real random matrices. Then X i E A i B i 6 X i E A ⊤ i A i 1 / 2 X i E B ⊤ i B i 1 / 2 . Pro of. W e have for any unit x , y , x ⊤ X i E A i B i x = X i E h A i x , B i y i 6 X i E k A i x kk B i y k 43 6 X i ( E k A i x k 2 ) 1 / 2 ( E k B i x k 2 ) 1 / 2 6 s X i E k A i x k 2 s X i E k B i y k 2 = s E x ⊤ X i A ⊤ i A i x s E y ⊤ X i B ⊤ i B i y 6 X i E A ⊤ i A i 1 / 2 X i E B ⊤ i B i 1 / 2 . where the nontrivial ine q u alities follow fr om Cauchy-Schwarz for exp ectations, vectors and s calars, respectively . The followng lemma allows t o argue about the top eigenvecto r of matrices w ith spectral gap. Lemma A. 3 (T op eige nvector of gapped matrices) . Let M be a symmetric r-by-r matrix and let u , v be a vectors in R r with k u k = 1 . Suppose u is a top singular vector of M so that |h u , Mu i| = k M k and v satisfies for some ε > 0 , k M − vv ⊤ k 6 k M k − ε · k v k 2 Then, h u , v i 2 > ε · k v k 2 . Pro of. W e lowe r bound the quadratic form of M − vv ⊤ evaluated at u by |h u , ( M − vv ⊤ ) u i| > |h u , Mu i| − h u , v i 2 = k M k − h u , v i 2 . At the same t ime, this quadratic form evaluated at u is upper bounde d by k M k − ε · k v k 2 . It follows that h u , v i 2 > ε · k v k 2 as de sired. The following lemma st ates that a vector in R d 2 which is close to a sy mmetric vector v ⊗ 2 , if flattened t o a matrix, has top eige nvector correlated with the symmetric vector . Lemma (Restatement of Lemma 5.19 ) . Let M ∈ R d 2 × d 2 be a symmetric matrix with k M k 6 1 , and let v ∈ R d and u ∈ R d 2 be vectors. Furthermor e, let U be the resh aping of the vector M u ∈ R d 2 to a matrix in R d × d . Fix c > 0 , and suppose that h Mu , v ⊗ v i 2 > c 2 · k u k 2 · k v k 4 . Then U has some left singular vector a and some right singular vector b such that |h a , v i| , |h b , v i| > c · k v k . Furthermor e, for any 0 < α < 1 , ther e are a ′ , b ′ among the top ⌊ 1 α c 2 ⌋ singular vectors of U with |h a ′ , v i| , |h b ′ , v i| > √ 1 − α · c · k v k . If c > q 1 2 (1 + η ) for some η > 0 , then a , b are amongst the top ⌊ (1 + η ) η c 2 ⌋ singular vectors. 44 Pro of. L et ˆ v = v / k v k . Le t ( σ i , a i , b i ) be the i th singular value, left and right (unit) s ingular vecto rs of U respectively . Our assumptions imply t hat ˆ v ⊤ U ˆ v = |h Mu , ˆ v ⊗ ˆ v i| > c · k u k . Furthermore, we observe that k U k F = k Mu k 6 k M k · k u k , and that the refor e k U k F 6 k u k . W e thus have that, c · k u k 6 ˆ v ⊤ U ˆ v = X i ∈ [ d ] σ i · h ˆ v , a i ih ˆ v , b i i 6 k u k · s X i ∈ [ d ] h ˆ v , a i i 2 h ˆ v , b i i 2 , where t o obtain the last inequality we have use d Cauchy-Schwarz and our bound on k U k F . W e may thus conclude that c 2 6 X i ∈ [ d ] h ˆ v , a i i 2 h ˆ v , b i i 2 6 max i ∈ [ d ] h a i , ˆ v i 2 · X i ∈ [ d ] h b i , ˆ v i 2 = max i ∈ [ d ] h a i , ˆ v i 2 , (A.1) where we have u sed the fact that the left singular values of U are orthonormal. The argument is symmetric in t h e b i . Furthermore, we have that c 2 · k u k 2 6 ˆ v ⊤ U ˆ v 2 = X i ∈ [ d ] σ i · h ˆ v , a i ih ˆ v , b i i 2 6 X i ∈ [ d ] σ 2 i h ˆ v , a i i 2 · X i ∈ [ d ] h ˆ v , b i i 2 = X i ∈ [ d ] σ 2 i h ˆ v , a i i 2 , where we have applied Cauchy-Schwarz and the orthono rmality of the b i . In particular , X i ∈ [ d ] σ 2 i h ˆ v , a i i 2 > c 2 k u k 2 > c 2 k U k 2 F . On t he o ther hand, let S be the s et of i ∈ [ d ] for which σ 2 i 6 α c 2 k U k 2 F . By substitut ion, X i ∈ S σ 2 i h ˆ v , a i i 2 6 α c 2 k U k 2 F X i ∈ S h ˆ v , a i i 2 6 α c 2 k U k 2 F , where w e have used th e fact that the right singular vectors are orthonormal. T he last two inequal- ities imply th at S , [ d ]. Lett ing T = [ d ] \ S , it follows from subtraction that (1 − α ) c 2 k U k 2 F 6 X i ∈ T σ 2 i h ˆ v , a i i 2 6 max i ∈ T h ˆ v , a i i 2 X i ∈ T σ 2 i = max i ∈ T h ˆ v , a i i 2 k U k 2 F , so that max i ∈ T h ˆ v , a i i 2 > (1 − α ) c 2 . Finally , | T | · α c 2 k U k 2 F 6 | T | · min i ∈ T σ 2 i 6 X i ∈ [ d ] σ 2 i = k U k 2 F , so that | T | 6 ⌊ 1 α c 2 ⌋ . Thus, one o f the t op ⌊ 1 α c 2 ⌋ right singular vectors a has correlation |h ˆ v , a i| > p (1 − α ) c . The same proof holds for the b . Furthermore, if c 2 > 1 2 (1 + η ) for some η > 0, and (1 − α ) c 2 > 1 2 , then by (A.1) it must be that max i ∈ T h ˆ v , a i i 2 = max i ∈ [ d ] h ˆ v , a i i 2 , as ˆ v cannot have square correlation lar ger than 1 2 with more than one left s ing u lar vector . T aking α = η 1 + η guarantees t his. The conclusion follows. 45 A.2 Concentration tools W e require a number of tools from the literature on concentration of measure. A.2.1 For scalar -valued p olynomials of Gaussians W e ne ed the so me concent ration bound s for certain polyno mials of Gaussian random variables. The fo llowing lemma gives standard bounds on t he tails of a standar d gaussian varia ble— somewhat mor e precisely than othe r bou nds in this paper . Though there ar e ample s o urces, we repeat the proof here for reference. Lemma A.4. Let X ∼ N (0 , 1) . Then for t > 0 , P ( X > t ) 6 e − t 2 / 2 t √ 2 π , and P ( X > t ) > e − t 2 / 2 √ 2 π · 1 t − 1 t 3 . Pro of. T o s h o w t h e firs t stateme n t , we apply an integ r ation trick, P ( X > t ) = 1 √ 2 π Z ∞ t e − x 2 / 2 dx 6 1 √ 2 π Z ∞ t x t e − x 2 / 2 dx = e − t 2 / 2 t √ 2 π , where in the third ste p we hav e used the fa ct that x t 6 x for t > x . Fo r the second statemen t , w e integrate by parts and repeat t h e t r ick, P ( X > t ) = 1 √ 2 π Z ∞ t e − x 2 / 2 dx = 1 √ 2 π Z ∞ t 1 x · xe − x 2 / 2 dx = 1 √ 2 π − 1 x e − x 2 / 2 · ∞ t − 1 √ 2 π Z ∞ t 1 x 2 · e − x 2 / 2 dx > 1 √ 2 π − 1 x e − x 2 / 2 · ∞ t − 1 √ 2 π Z ∞ t x t 3 · e − x 2 / 2 dx = 1 √ 2 π 1 t − 1 t 3 e − t 2 / 2 . This concludes t he p roof. The following is a small mod ification o f Theorem 6.7 fr om [ Jan97 ] which follows fr om Re mark 6.8 in th e s ame. 46 Lemma A.5. For each ℓ > 1 there is a universal constant c ℓ > 0 such that for every f a degr ee- ℓ polynomial of standard G aussian random variables X 1 , . . . , X m and t > 2 , P ( | f ( X ) | > t E | f ( X ) | ) 6 e − c ℓ t 2 /ℓ . The same holds (with a di ff er ent constant c ℓ ) if E | f ( x ) | is repla ced by ( E f ( x ) 2 ) 1 / 2 . In our concentration r esults, we w ill need to calculate the ex p ectations of multivaria te Gaussian polynomials, many of wh ich share a common form. B elow we give an expression for thes e expectations. Fact A.6. L et x be a d-dimensional vector with in dependent identically distr ibuted gaussian entries with variance σ 2 . Let u be a fixed un it vector . T hen setting X = ( k x k 2 − c ) p k x k 2 m xx T , and setting U = ( k x k 2 − c ) p k x k 2 m uu T , we have E [ X ] = X 0 6 k 6 p p k ! ( − 1) k c k ( d + 2) · · · ( d + 2 p + 2 m − 2 k ) σ 2( p + m − k + 1 ) · I d , and E [ U ] = X 0 6 k 6 p p k ! ( − 1) k c k d ( d + 2) · · · ( d + 2 p + 2 m − 2 k − 2) σ 2( p + m − k ) · u u T Pro of. E [ X ] = E [( k x k 2 − c ) p k x k 2 m x 2 1 ] · Id = Id · X 0 6 k 6 p p k ! ( − 1) k c k E X ℓ ∈ [ d ] x 2 i p + m − k x 2 1 Since P i ∈ [ d ] x 2 i p + m − k is symmetric in x 1 , . . . , x d , we have = Id · 1 d X 0 6 k 6 p p k ! ( − 1) k c k E X i ∈ [ d ] x 2 i p + m − k + 1 W e have reduced t he compu tation to a que stion of the moments of a Chi-squar ed varia ble with d degrees of freedom. U sing thes e moments , = Id · 1 d X 0 6 k 6 p p k ! ( − 1) k c k d ( d + 2) · · · ( d + 2 p + 2 m − 2 k ) σ 2( p + m − k + 1 ) = Id · X 0 6 k 6 p p k ! ( − 1) k c k ( d + 2) · · · ( d + 2 p + 2 m − 2 k ) σ 2( p + m − k + 1 ) . A similar computation yields the result about E [ U ]. 47 A.2.2 For matrix-valued random variables On s everal occasions we will need to apply a Matrix-Berns tein-like theorem to a sum of matrices with an unfortunate t ail. T o this end, we prove a “trunca ted Matrix B ernstein Inequality .” Our proof uses an standar d matrix Bernste in inequal ity as a bl ack box. The study of ine q u alities of this variety—on tails of s ums of independent matrix-valued rando m variables— w as initiated by Ahlswede and W inter [ A W02 ]. The excellent surve y of T ropp [ T ro12 ] provi des many results o f this kind. In applications of the following the op erator norms of the summands X 1 , . . . , X n have well- behaved tails and s o the truncation is a routine formality . T wo coroll aries following the pr oposition and its proof capture truncation fo r all t h e matrices we e ncounter in the present work. Proposition A.7 (T runcated Matrix Bernste in) . Let X 1 , . . . , X n ∈ R d 1 × d 2 be in dependent random matri- ces, and suppose that P h k X i − E [ X i ] k op > β i 6 p for all i ∈ [ n ] . Furthermor e, suppose that for each X i , E [ X i ] − E [ X i I h k X i k op < β i ] 6 q . Denote σ 2 = max X i ∈ [ n ] E h X i X T i i − E [ X i ] E h X T i i op , X i ∈ [ n ] E h X T i X i i − E [ X i ] T E [ X i ] op . Then for X = P i ∈ [ n ] X i , we have P h k X − E [ X ] k op > t i 6 n · p + ( d 1 + d 2 ) · exp − ( t − n q ) 2 2( σ 2 + β ( t − n q ) / 3) ! . Pro of. F or simpli city we start by centering the variab les X i . Let ˜ X i = X i − E X i and ˜ X = P i ∈ [ n ] ˜ X i The proof proceeds by a straightforward application of the noncommutative B ernstein’s Inequality . W e defin e va riables Y 1 , . . . , Y n , which are t he truncated counte rparts of t he ˜ X i s in the following sense: Y i = ˜ X i k ˜ X i k op < β, 0 otherwise . Define Y = P i ∈ [ n ] Y i . W e claim that X i E Y i Y T i − E [ Y i ] E [ Y i ] T op 6 X i E ˜ X i ˜ X T i op 6 σ 2 and (A.2) X i E Y T i Y i − E [ Y i ] T E [ Y i ] op 6 X i E ˜ X T i ˜ X i op 6 σ 2 , (A.3) which, toget her with the fact that k Y i k 6 β almost surely , will a llow us to apply the non- commutative B ernstein’s inequality to Y . T o see (A.2) ( (A.3) is s imila r), we exp and E Y i Y T i as E Y i Y T i = P ˜ X i op < β E ˜ X i ˜ X T i ˜ X i op < β . 48 Additionally expanding E h ˜ X i ˜ X T i i as E h ˜ X i ˜ X T i i = P ˜ X i op < β E ˜ X i ˜ X T i ˜ X i op < β + P ˜ X i op > β E ˜ X i ˜ X T i ˜ X i op > β , we note that E [ ˜ X i ˜ X T i | ˜ X i op > β ] is PSD. Thus, E [ Y i Y T i ] E [ X i X T i ]. But by de finition E [ Y i Y T i ] is still PS D (and h e nce P i E [ Y i Y T i ] op is given by the maximum e igenvalue of E [ Y i Y T i ]), so X i E Y i Y T i op 6 X i E ˜ X i ˜ X T i op . Also PSD ar e E [ Y i ] E [ Y i ] T and E [( Y i − E [ Y i ])( Y i − E [ Y i ]) T ] = E [ Y i Y T i ] − E [ Y i ] E [ Y i ] T . By the same reasoning again, then, we get P i E Y i Y T i − E [ Y i ] E [ Y i ] T op 6 P i E [ Y i Y T i ] op . Putting this all togeth e r g ives (A.2) . Now we ar e ready to apply the non-commutative Ber n s tein’s inequality to Y . W e have P h k Y − E [ Y ] k op > α i 6 ( d 1 + d 2 ) · exp − α 2 / 2 σ 2 + β · α/ 3 ! . Now , we have P h k X − E [ X ] k op > t i = P h k X − E [ X ] k op > t | X = Y i · P [ X = Y ] + P h k X − E [ X ] k op > t | X , Y i · P [ X , Y ] , 6 P h k X − E [ X ] k op > t | X = Y i + n · p by a union bound over the events { X i , Y i } . It remains to bound the conditional pr obability P h k X − E [ X ] k op > t | X = Y i . By assumpt ion, k E [ X ] − E [ Y ] k op 6 nq , and s o by the triangle inequality , k X − E [ X ] k op 6 k X − E [ Y ] k op + k E [ Y ] − E [ X ] k op 6 k X − E [ Y ] k op + n q . Thus, P h k X − E [ X ] k op > t | X = Y i 6 P h k X − E [ Y ] k op + n q > t | X = Y i = P h k Y − E [ Y ] k op > t − n q | X = Y i . Putting everything toget her and se tting α = t − nq , P [ k X − E [ X ] k op > t ] 6 n · p + ( d 1 + d 2 ) · exp − ( t − n q ) 2 / 2 σ 2 + β ( t − n q ) / 3 ! , as desired. The following lemma helps achieve the assumptions o f Proposition A.7 easily for a use ful class of thin-tailed random matrices. 49 Lemma A.8. Suppose that X is a matrix w hose entries ar e polynomia ls of constant degre e ℓ in unknowns x, which we ev aluate at independent Gaussians. L et f ( x ) : = k X k op and 1 ( x ) : = k XX T k op , and either f is itself a polynomial in x of degre e at most 2 ℓ or 1 is a polynomial in x of degre e at most 4 ℓ . Then if β = R · α for α > min { E | f ( x ) | , p E 1 ( x ) } and R = p o lylog( n ) , P k X k op > β 6 n − log n , (A.4) and E h k X · I {k X k op > β }k op i 6 ( β + α ) n − log n . (A.5) Pro of. W e begin with (A.4) . E ithe r f ( x ) is a polynomial o f degree at most 2 ℓ , or 1 ( x ) is a polyno mial of degree at most 4 ℓ in gaussian variab les. W e can thus use Lemma A.5 to obtain t he following bound, P | f ( x ) | > t α 6 exp − ct 1 / (2 ℓ ) , (A.6) where c is a universal constant. T aking t = R = polylog( n ) gives us (A.4) . W e now add ress (A.5) . T o this end, let p ( t ) and P ( t ) be the probabil ity density function and cumulative dens ity function of k X k op , respectively . W e apply Je nsen’s inequality and instead bound k E h X I {k X k op > β } i k 6 E h k X k op I {k X k op > β } i = Z ∞ 0 t · I { t > β } p ( t ) dt since the indicator is 0 for t 6 β , = Z ∞ β ( − t )( − p ( t )) dt integrating by p arts, = − t · (1 − P ( t )) ∞ β + Z ∞ β (1 − P ( t )) dt and us ing the eq u ality o f 1 − P ( t ) with P ( k X k op > t ) along w ith (A.4) , 6 β n − log n + Z ∞ β P ( k X k op > t ) dt Applying the change of variables t = α s so as to apply (A.6) , = β n − log n + α Z ∞ R P ( k X k op > α s ) ds 6 β n − log n + α Z ∞ R exp( − cs 1 / (2 ℓ ) ) ds 50 Now applying a change of variables s o s = ( u log n c ) 2 ℓ , = β n − log n + α Z ∞ cR 1 / (2 ℓ ) log n n − u · 2 ℓ log n c ! 2 ℓ u 2 ℓ − 1 du 6 β n − log n + α Z ∞ cR 1 / (2 ℓ ) log n n − u / 2 du , where we hav e used the assumption t hat ℓ is constant. W e can approximate this by a geometric sum, 6 β n − log n + α ∞ X u = cR 1 / (2 ℓ ) log n n − u / 2 6 β n − log n + α · n − cR 1 / (2 ℓ ) / (2 log n ) Evaluating at R = po lylog n for a s u ffi ciently large polynomial in the log gives u s E h k X · I {k X k op > β }k op i 6 ( β + α ) n − log n , as desired. B Concentration bounds for planted sparse vector in random linear subspace Pro of of Lemma 4.7 . Let c : = P n i = 1 v ( i ) b i . The matrix in ques tion has a nice block structure: n X i = 1 a i a ⊤ i = k v k 2 2 c ⊤ c P n i = 1 b i b ⊤ i . The vector c is distribute d as N (0 , 1 n Id d − 1 ) so by standard concentration has k c k 6 ˜ O ( d / n ) 1 / 2 w .ov .p.. By assumption, k v k 2 2 = 1. Thus by triangle inequality w .ov .p . n X i = 1 a i a ⊤ i − I d d 6 ˜ O d n ! 1 / 2 + n X i = 1 b i b ⊤ i − I d d − 1 . By [ V er10 , Corollary 5.50] applied to the subgaussian vectors nb i , w .ov .p. n X i = 1 b i b ⊤ i − I d d − 1 6 O d n ! 1 / 2 and he nce k P n i = 1 a i a ⊤ i − Id d k 6 ˜ O ( d / n ) 1 / 2 w .ov .p.. This implies k ( P n i = 1 a i a ⊤ i ) − 1 − Id d k 6 ˜ O ( d / n ) 1 / 2 and k ( P n i = 1 a i a ⊤ i ) − 1 / 2 − I d d k 6 ˜ O ( d / n ) 1 / 2 when d = o ( n ) by t he following facts applied to t h e e igenvalues of P n i = 1 a i a ⊤ i . For 0 6 ε < 1, (1 + ε ) − 1 = 1 − O ( ε ) and (1 − ε ) − 1 = 1 + O ( ε ) , (1 + ε ) − 1 / 2 = 1 − O ( ε ) and (1 − ε ) − 1 / 2 = 1 + O ( ε ) . These ar e proved easily via t h e identity (1 + ε ) − 1 = P ∞ k = 1 ε k and similar . 51 Orthogonal subspace basis Lemma B.1. Let a 1 , . . . , a n ∈ R d be independe nt random vecto rs from N (0 , 1 n Id) with d 6 n and let A = P n i = 1 a i a ⊤ i . Then for every un it vector x ∈ R d , with overwhelming proba bility 1 − d − ω (1) , h x , A − 1 x i − k x k 2 6 ˜ O d + √ n n ! · k x k 2 . Pro of. L et x ∈ R d . By scale invariance, we may assume k x k = 1. By s tandard matrix concentration bound s, the matrix B = Id − A has spe ctral norm k B k 6 ˜ O ( d / n ) 1 / 2 w .ov .p. [ V er10 , Cor ollary 5.50 ]. Since A − 1 = (Id − B ) − 1 = P ∞ k = 0 B k , the spe ctral norm of A − 1 − Id − B is at most P ∞ k = 2 k B k k (whenever t h e series converges). Hence, k A − 1 − Id − B k 6 ˜ O ( d / n ) w .ov .p.. It follows that it is enough to show that |h x , Bx i| 6 ˜ O (1 / n ) 1 / 2 w .ov .p.. The random variable n − n h x , Bx i = P n i = 1 h √ n · a i , x i 2 is χ 2 -distributed with n degrees of freedom. T hus, by st andar d concentration bo u nds, n |h x , B x i| 6 ˜ O ( √ n ) w .o v . p. [ LM00 ]. W e conclude that w ith overwh e lming probabil ity 1 − d − ω (1) , h x , A − 1 x i − k x k 2 6 |h x , Bx i| + ˜ O ( d / n ) 6 ˜ O d + √ n n ! . Lemma B.2. Let a 1 , . . . , a n ∈ R d be independe nt random vecto rs from N (0 , 1 n Id) with d 6 n and let A = P n i = 1 a i a ⊤ i . Then for every index i ∈ [ n ] , with overwhelming pr obability 1 − d ω (1) , h a j , A − 1 a j i − k a j k 2 6 ˜ O d + √ n n ! · k a j k 2 . Pro of. L et A − j = P i , j a i a ⊤ i . By Sh e rman–Morrison, A − 1 = ( A − j + a j a ⊤ j ) − 1 = A − 1 − j − 1 1 + a ⊤ j A − 1 − j a j A − 1 − j a j a ⊤ j A − 1 − j Thus, h a j , A − 1 a j i = h a j , A − 1 − j a j i − h a j , A − 1 − j a j i 2 / (1 + h a j , A − 1 − j a j i ). Since k n n − 1 A − j − I d k = ˜ O ( d / n ) 1 / 2 w .ov .p., we also have k A − 1 − j k 6 2 with overw helming probabili ty . The refor e, w .ov .p., h a j , A − 1 a j i − h a j , A − 1 − j a j i 6 h a j , A − 1 − j a j i 2 6 4 k a j k 4 6 ˜ O ( d / n ) · k a j k 2 . At the same time, by Lemma B.1 , w .ov .p., h a j , n n − 1 A − 1 − j a j i − k a j k 2 6 ˜ O d + √ n n ! · k a j k 2 . W e conclude that, w .o v .p., h a j , A − 1 a j i − k a j k 2 6 h a j , A − 1 a j i − h a j , A − 1 − j a j i + h a j , A − 1 − j a j i − n − 1 n k a j k 2 + 1 n k a j k 2 6 ˜ O d + √ n n ! . 52 Lemma B.3. Let A be a block matrix wher e one of the diagonal blocks is the 1 × 1 identity; that is, A = k v k 2 c ⊤ c B = 1 c ⊤ c B . for some m atrix B and vector c. Let x be a vector which decomposes as x = ( x (1) x ′ ) where x (1) = h x , e 1 i for e 1 the first standard basis vector . Then h x , A − 1 x i = h x ′ , B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! x ′ i + 2 x (1) h B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! c , x ′ i + (1 − c ⊤ B − 1 c ) − 1 x (1) 2 . Pro of. B y the formula for block matrix inverses, A − 1 = (1 − c ⊤ B − 1 c ) − 1 c T ( B − cc ⊤ ) − 1 ( B − cc ⊤ ) − 1 c ( B − cc ⊤ ) − 1 . The result follows by Sherman-Morrison applied to ( B − cc ⊤ ) − 1 and the definition o f x . Lemma B.4. Let v ∈ R n be a u n it ve ctor and let b 1 , . . . , b n ∈ R d − 1 have iid entries fr om N (0 , 1 / n ) . Let a i ∈ R d be given by a i : = ( v ( i ) b i ) . L et A : = P i a i a T i . Let c ∈ R d − 1 be given by c : = P i v ( i ) b i . Then for every index i ∈ [ n ] , w .ov .p., h a i , A − 1 a i i − k a i k 2 6 ˜ O d + √ n n ! · k a i k 2 . Pro of. L et B : = P i b i b T i . By standar d concentration, k B − 1 − Id k 6 ˜ O ( d / n ) 1 / 2 w .ov .p. [ V er10 , Corol lary 5.50]. At the same time, since v has unit no rm, the entries of c ar e iid samples fr om N (0 , 1 / n ), and hence n k c k 2 is χ 2 -distributed with d degrees of fr eedom. Thus w .ov .p . k c k 2 6 d n + ˜ O ( dn ) − 1 / 2 . T ogethe r t h e se imply the following us eful es timates, all of which hold w .ov .p.: | c ⊤ B − 1 c | 6 k c k 2 k B − 1 k op 6 d n + ˜ O d n ! 3 / 2 k B − 1 cc ⊤ B − 1 k op 6 k c k 2 k B − 1 k 2 op 6 d n + ˜ O d n ! 3 / 2 B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c op 6 d n + ˜ O d n ! 3 / 2 , where the first two use Cauchy-Schwarz and the last follows from the first two . W e tu rn now to the ex p ansion of h a i , A − 1 a i i o ff ered by Lemma B.3 , h a i , A − 1 a i i = h b i , B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! b i i (B.1) + 2 v ( i ) h B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! c , b i i (B.2) + (1 − c ⊤ B − 1 c ) − 1 v ( i ) 2 . (B.3) 53 Addressing (B.1) first, by the above es t imates and Lemma B.2 applied to h b i , B − 1 b i i , h b i , B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! b i i − k b i k 2 6 ˜ O d + √ n n ! · k b i k 2 w .ov .p.. For (B.2) , we pull ou t the important factor of k c k and sep arate v ( i ) from b i : w . ov .p., 2 v ( i ) h B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! c , b i i = 2 k c k v ( i ) h B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! c k c k , b i i 6 k c k 2 v ( i ) 2 + h B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! c k c k , b i i 2 ! 6 ˜ O d n ! ( v ( i ) 2 + k b i k 2 ) = ˜ O d n ! k a i k 2 , where the last inequality follows from our estimates above and Cauchy-Schwarz. Finally , for (B.3) , since (1 − c ⊤ B − 1 c ) > 1 − ˜ O ( d / n ) w . o v .p., we have that | (1 − c ⊤ B − 1 c ) − 1 v ( i ) 2 − v ( i ) 2 | 6 ˜ O d n ! v ( i ) 2 . Putting it all toge t her , h a i , A − 1 a i i − k a i k 2 6 h b i , B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! b i i − k b i k 2 + 2 v ( i ) h B − 1 + B − 1 cc ⊤ B − 1 1 − c ⊤ B − 1 c ! c , b i i + | (1 − c ⊤ B − 1 c ) − 1 v ( i ) 2 − v ( i ) 2 | 6 ˜ O d + √ n n ! · k a i k 2 . C Concentration bounds for overcomplete tensor decomposition W e requir e some facts about the concent ration of certain scalar -and matrix-valued random va ri- ables, which generally follow from standard concentration arguments. W e present proofs here for completeness . The first lemma captures standard facts about random Gaussians. Fact C.1. Let a 1 , . . . , a n ∈ R d be sampled a i ∼ N (0 , 1 d Id) . 1. Inner prod ucts |h a i , a j i| ar e all ≈ 1 / √ d: P h a i , a j i 2 6 ˜ O 1 d ∀ i , j ∈ [ n ] , i , j > 1 − n − ω (1) . 54 2. Norms are all about k a i k ≈ 1 ± ˜ O (1 / √ d ) : P 1 − ˜ O (1 / √ d ) 6 k a i k 2 2 6 1 + ˜ O (1 / √ d ) ∀ i ∈ [ n ] > 1 − n − ω (1) . 3. Fix a vector v ∈ R d . Suppose 1 ∈ R d is a vector with entrie s identically distributed 1 i ∼ N (0 , σ ) . Then h 1 , v i 2 ≈ σ 2 · k v k 2 2 : P h 1 , v i 2 − σ 2 · k v k 4 2 6 ˜ O ( σ 2 · k v k 2 4 ) > 1 − n − ω (1) . Pro of of Fact C.1 . W e start w ith Item 1 . Consider the quantity h a i , a j i 2 . W e calcul ate the expectation, E h h a i , a j i 2 i = X k ,ℓ ∈ [ d ] E h a i ( k ) a i ( ℓ ) a j ( k ) a j ( ℓ ) i = X k ∈ [ d ] E h a i ( k ) 2 i · E h a j ( k ) 2 i = d · 1 d 2 = 1 d . Since this is a deg ree-4 square polyno mial in th e e ntries of a i and a j , we may apply Lemma A.5 to conclude t h at P h a i , a j i 2 > t · 1 d 6 exp − O ( t 1 / 2 ) . Applying this fact with t = p olylog( n ) and taking a union bound over pairs i , j ∈ [ n ] gives us t he desired result. Next is Item 2 . Conside r the quantity k a i k 2 2 . W e will apply Lemma A.5 in order to obt ain a tail bound for th e value of the po lyn o mial ( k a i k 2 2 − 1) 2 . W e have E h ( k a i k 2 2 − 1) 2 i = O 1 d , and now applying Lemma A.5 with the square root of this expe ctation, we have P k a i k 2 2 − 1 > ˜ O ( 1 √ d ) 6 n − log n . This gives both bounds for a single a i . The result now follows from taking a union bound over all i . Moving on to Item 3 , we view the expression f ( 1 ) : = ( h 1 , v i 2 − σ 2 k v k 2 ) 2 as a polynomial in the gaussian entries of 1 . T h e d e gree of f ( 1 ) is 4, and E [ | f ( 1 ) | ] = 3 σ 4 · k v k 4 4 , and so we may apply Lemma A.5 to conclude that P | f ( 1 ) | > t · 3 σ 4 · k v k 4 4 6 exp( − c 4 t 1 / 2 ) , and taking t = p olylog( n ) t he con clusion follows. W e also use the fact that the covari ance matrix of a sum of su ffi ciently many gauss ian o u ter products concentrates about its exp e ctation. Fact C.2. Let a 1 , . . . , a n ∈ R d be vectors with iid gaussian entries such that E h k a i k 2 2 i = 1 , and n = Ω ( d ) . Let E be the event that the sum P i ∈ [ n ] a i a ⊤ i is close to n d · I d , that is P ˜ Ω ( n / d ) · Id 6 X i ∈ [ n ] a i a ⊤ i 6 ˜ O ( n / d ) · Id > 1 − n − ω (1) . 55 Pro of of Fact C.2 . W e app ly a truncated matrix berns t ein inequality . For convenience, A : = P i ∈ [ n ] a i a ⊤ i and let A i : = a i a ⊤ i be a s ingle s ummand. T o begin, we cal culate the first and se cond moments of t he s ummands, E [ A i ] = 1 d · I d E h A i A ⊤ i i = O 1 d · Id . So w e have E [ A ] = n d · I d and σ 2 ( A ) = O n d . W e now show that each summand is well-approximated by a truncated variable. T o calculate the expe cted norm k A i k op , we observe that A i is rank-1 and thus E h k A i k op i = E h k a i k 2 2 i = 1. Applying Lemma A.8 , w e have P k A i k op > ˜ O (1) 6 n − log n , and also E h k A i k op · I {k A i k op > ˜ O (1) } i 6 n − log n . Thus, applying the truncated matrix bernstein inequality from Proposition A.7 with σ 2 = O ( n d ), β = ˜ O (1), p = n − log n , q = n − log n , and t = ˜ O n 1 / 2 d 1 / 2 , we have that with overwhe lming probabil ity , A − n d · I d op 6 ˜ O n 1 / 2 d 1 / 2 ! . W e now show that among the terms of the polynomial h 1 , T a ⊗ 2 i i , t hose that depe nd o n a j with j , i have small magnitude. This polynomial appears in the pr oof that M diag has a noticeable spectral g ap. Lemma (Restatemen t o f Lemma 5.6 ) . Let a 1 , . . . , a n be in dependently sampled vectors from N (0 , 1 d Id d ) , and let 1 be sampled from N (0 , Id d ) . L et T = P i a i ( a i ⊗ a i ) ⊤ . Then with overwhelming proba bility , for every j ∈ [ n ] , h 1 , T ( a j ⊗ a j ) i − h 1 , a j ik a j k 4 6 ˜ O √ n d ! Pro of. F ixing a i and 1 , the terms in the s ummation are indepe ndent, and we may app ly a Bernst ein inequality . A straightforw ar d calculation shows t hat the expectation o f the sum is 0 and t he variance is ˜ O ( n d 2 ) · k 1 k 2 k a i k 4 . Ad ditionally , each summand is a p olynomial in Ga ussian variabl es, the square of which has expectation ˜ O ( 1 d 2 · k 1 k 2 k a i k 4 ). Thus Lemma A.5 allows us to truncate each summand appropriately s o as t o employ Proposition A.7 . An appropriate choice of logarithmic factors and the concentration of k 1 k 2 and k a i k 2 due to Fact C.1 gives the result for each i ∈ [ n ]. A union bound o ve r each choice o f i gives the final result. Finally , we prove that a matrix wh ich appears in the expression for M same has bound ed norm w .ov .p. 56 Lemma C.3. Let a 1 , . . . , a n be in dependent fr om N (0 , 1 d Id d ) . Let 1 ∼ N (0 , Id d ) . Fix j ∈ [ n ] . Then w .ov .p. X i ∈ [ n ] i , j h 1 , a i ik a i k 2 h a i , a j i · a i a ⊤ i 6 ˜ O ( n / d 2 ) 1 / 2 . Pro of. T he proof proceeds by truncated matrix Berns tein, since the summands ar e indep endent for fixed 1 , a j . For this we need to compute the variance: σ 2 = X i ∈ [ n ] i , j E h 1 , a i i 2 k a i k 6 h a i , a j i 2 · a i a ⊤ i 6 O (1 / d ) · X i ∈ [ n ] i , j E a i a ⊤ i 6 O (1 / d ) · n / d 6 O ( n / d 2 ) . The norm of each term in t he sum is bound e d by a constant-degree polynomial of Gauss ians. Straightforward calculations s how that in expe ctation e ach t e rm is O ( 1 d h 1 , a i i ) in norm; w .ov . p . this is O ( σ ). So Lemma A.5 applies to es tablish the hy pothes is of truncated Bernst ein Proposition A. 7 . In t urn, Proposition A.7 yields that w .ov .p. X i ∈ [ n ] i , j h 1 , a i ik a i k 2 h a i , a j i · a i a ⊤ i 6 ˜ O ( σ ) = ˜ O ( n / d 2 ) 1 / 2 . C.0.3 Proof of Fact C.4 Here we prove the following fact. Fact C. 4. Let Σ = E x ∼N (0 , Id d ) ( xx ⊤ ) ⊗ 2 and let ˜ Σ = E x ∼N (0 , Id d ) ( xx ⊤ ) ⊗ 2 / k x k 4 . Let Φ = P i e ⊗ 2 i ∈ R d 2 and let Π sym be the proj ector to the symmetric subspace of R d 2 (the span of vectors of the for m x ⊗ 2 for x ∈ R d ). Then Σ = 2 Π sym + Φ Φ ⊤ , ˜ Σ = 2 d 2 + 2 d Π sym + 1 d 2 + 2 d ΦΦ ⊤ , Σ + = 1 2 Π sym − 1 2( d + 2) ΦΦ ⊤ , ˜ Σ + = d 2 + 2 d 2 Π sym − d 2 ΦΦ ⊤ . In particular , R = √ 2 ( Σ + ) 1 / 2 = Π sym − 1 d 1 − q 2 d + 2 ΦΦ ⊤ has k R k = 1 and for any v ∈ R d , k R ( v ⊗ v ) k 2 2 = 1 − 1 d + 2 · k v k 4 . W e wil l derive Fact C.4 as a cor ollary of a mor e general claim about r otationally symmetric distributions. 57 Lemma C.5. Let D be a distributio n over R d which is r otatio nally symmetri c; that is , for any rot ation R, x ∼ D is distributed id entically to Rx. L et Σ = E x ∼D ( xx ⊤ ) ⊗ 2 , let Φ = P i e ⊗ 2 i ∈ R d 2 and let Π sym be th e pr ojector to the symmetric subspace of R d 2 (the span of vectors of the form x ⊗ 2 for x ∈ R d ). Then ther e is a constant r so that Σ = 2 r Π sym + r ΦΦ ⊤ . Furthermor e, r is given by r = E h x , a i 2 h x , b i 2 = 1 3 E h x , a i 4 wher e a , b are orthogonal un it vectors. Pro of. F irst, Σ is symmetric and operates nontrivially only on the symmetric subspace (in o ther words ker Π sym ⊆ ker Σ ). This follows from Σ being an expectation over symmetric matrices whose kernels always cont ain th e compleme n t of t he s ymmetric s u bspace. Let ˆ a , ˆ b , ˆ c , ˆ d ∈ R d be any four ortho gonal unit vectors . Le t R be any rotation of R d that takes ˆ a to − ˆ a , but fixe s ˆ b , ˆ c , and ˆ d (th is rotation exists fo r d > 5, but a d i ff er ent argument holds for d 6 4) . By rotational symmetry about R , all of thes e quantities are 0: E h ˆ a , x ih ˆ b , x ih ˆ c , x ih ˆ d , x i = 0 , E h ˆ a , x ih ˆ b , x ih ˆ c , x i 2 = 0 , E h ˆ a , x ih ˆ b , x i 3 = 0 . Furthermore, let Q be a rotation of R d that takes ˆ a to ( ˆ a + ˆ b ) / √ 2. Then by rotational symmetry about Q , E h ˆ a , x i 4 = E h ˆ a , Qx i 4 = E 1 4 h ˆ a + ˆ b , x i 4 = E 1 4 [ h ˆ a , x i 4 + h ˆ b , x i 4 + 6 h ˆ a , x i 2 h ˆ b , x i 2 ] Thus, since E h ˆ a , x i 4 = E h ˆ b , x i 4 by rotational symmetry , we have E h ˆ a , x i 4 = 3 E h ˆ a , x i 2 h ˆ b , x i 2 . So let r : = E h ˆ a , x i 2 h ˆ b , x i 2 = 1 3 E h ˆ a , x i 4 . By rotational symmetry , r is constant over choice of orthogo n al unit vecto rs ˆ a and ˆ b . Since Σ ope r ate s only o n t he symmetric subspace, let u ∈ R d 2 be any unit vector in the symmet ric subspace. Such a u unfolds to a s ymmetric matrix in R d × d , so that it has an eige ndecomposition u = P d i = 1 λ i u i ⊗ u i . Evaluating h u , Σ u i , h u , Σ u i = d X i , j = 1 E λ i λ j h x , u i i 2 h x , u j i 2 other t e rms are 0 by above = 3 r d X i = 1 λ 2 i + r X i , j λ i λ j = 2 r d X i = 1 λ 2 i + r d X i = 1 λ i 2 = 2 r k u k 2 + r d X i = 1 λ i 2 Frobenious norm is sum of squared eigenvalues 58 = 2 r k u k 2 + r X i u i , i 2 trace is sum of eigenvalues = 2 r h u , Π sym u i + r h u , ΦΦ ⊤ u i , so therefor e Σ = 2 r Π sym + r ΦΦ ⊤ . Pro of of Fact C.4 . Whe n x ∼ N (0 , Id d ), the expectation E h x , a i 2 h x , b i 2 = 1 is just a product of inde- pendent standard Gaussian second moments. Therefor e by Le mma C.5 , Σ = 2 Π sym + Φ Φ ⊤ . T o find ˜ Σ where x is uniformly distributed on the unit sphere, we compute 1 = E k x k 4 = X i , j E x 2 i x 2 j = d E x 4 1 + ( d 2 − d ) E x 2 1 x 2 2 and use the fact t hat E x 4 1 = 3 E x 2 1 (by Lemma C.5 ) to find that E x 2 1 x 2 2 = 1 d 2 + 2 d , and ther efore by Lemma C.5 , ˜ Σ = 2 d 2 + 2 d Π sym + 1 d 2 + 2 d ΦΦ ⊤ . T o verify the pse udoinverses , it is enough to che ck that M M + = Π sym for e ach matrix M and its claimed p seudoinvers e M + . T o s how that k R ( v ⊗ v ) k 2 2 = 1 − 1 d + 2 · k v k 4 , for any v ∈ R d , we write k R ( v ⊗ v ) k 2 2 = ( v ⊗ v ) ⊤ R 2 ( v ⊗ v ) and us e t h e substitution R 2 = 2 Σ + , along with the facts that Π sym ( v ⊗ v ) = v ⊗ v and h Φ , v ⊗ v i = k v k 2 . Now we can p rove some concentration claims we deferred: Lemma (Restatement of Lemma 5.11 ) . Let a 1 , . . . , a n ∼ N (0 , 1 d Id d ) . L et Σ , R be as in Fact 5.8 . Let u i = a i ⊗ a i . With overwhelming pr obabil ity , every j ∈ [ n ] satisfies P i , j h u j , R 2 u i i 2 = ˜ O ( n / d 2 ) and | 1 − k Ru j k 2 | 6 ˜ O (1 / √ d ) . Pro of of Lemma 5.11 . W e prove the first item: X i , j h u j , R 2 u i i 2 = X i , j h u j , 2 Σ + u i i 2 = X i , j h u j , ( Π sym − 1 d + 2 ΦΦ ⊤ ) u i i 2 by Fact C.4 = X i , j ( h a j , a i i 2 − 1 d + 2 k u j k 2 k u i k 2 ) 2 = X i , j ˜ O (1 / d ) 2 w .ov .p. by Fact C.1 = ˜ O ( n / d 2 ) . And one direction of the second item, using Fact C.4 and Fact C.1 (the ot her direction is similar): k Ru j k 2 = h u j , R 2 u j i = h u j , ( Π sym + 1 d + 2 ΦΦ ⊤ ) u j i = (1 − Θ (1 / d )) k a j k 4 = 1 − ˜ O (1 / √ d ) where the last equality holds w .ov .p.. 59 C.0.4 Proof of Lemma 5.9 T o p rove L emma 5.9 we will beg in by reducing t o th e case S = [ n ] via t he following. Lemma C.6. Let v 1 , . . . , v n ∈ R d . Let A S have columns { v i } i ∈ S . Let Π S be the proj ector to Span { v i } i ∈ S . Suppose ther e is c > 0 so that k A ⊤ [ n ] A [ n ] − I d n k 6 c. Then for every S ⊆ [ n ] , k A S A ⊤ S − Π S k 6 c Pro of. I f the hyp othesized bound k A ⊤ [ n ] A [ n ] − Id n k 6 c h o lds then for every S ⊆ [ n ] we get k A ⊤ S A S − Id | S | k 6 c since A ⊤ S A S is a principal submatrix of A ⊤ [ n ] A [ n ] . If k A ⊤ S A S − Id | S | k 6 c , t h e n because A S A ⊤ S has the s ame n o nzero eigenvalues as A ⊤ S A S , we must h ave also k A S A ⊤ S − Π S k 6 c . It will be convenient to r educe conce n t ration for matrices involving a i ⊗ a i to analogous matric es where the vectors a i ⊗ a i ar e replaced by isotropic vectors of cons tant no rm. The following lemma shows ho w to d o th is. Lemma C.7. Let a ∼ N (0 , 1 d Id d ) . Let ˜ Σ : = E x ∼N (0 , Id d ) ( xx ⊤ ) ⊗ 2 / k x k 4 . Then u : = ( ˜ Σ + ) 1 / 2 a ⊗ a / k a k 2 is an isotr opic random vector in the symmetric subspace Span { y ⊗ y | y ∈ R d } with k u k = p dim Span { y ⊗ y | y ∈ R d } . Pro of. T he vector u is isotropic by definition so we pr ove the norm claim. L et ˜ Φ = Φ / k Φ k . By Fact C.4 , ˜ Σ + = d 2 + 2 d 2 Π sym − d 2 ΦΦ ⊤ Thus, k u k 2 = h a ⊗ a k a k 2 , ˜ Σ + a ⊗ a k a k 2 i = d 2 + 2 d 2 − d 2 = d 2 + d 2 = dim Span { y ⊗ y | y ∈ R d } . The last ingredient to finish the s p ectral bound is a bound on t he incoherence of indepe ndent samples from ( ˜ Σ + ) 1 / 2 . Lemma C.8. Let ˜ Σ = E a ∼N (0 , Id d ) ( aa ⊤ ⊗ aa ⊤ ) / k a k 4 . Let a 1 , . . . , a n ∼ N (0 , Id d ) be indep endent, and let u i = ( ˜ Σ + ) 1 / 2 ( a i ⊗ a i ) / k a i k 2 . Let d ′ = dim Span { y ⊗ y | y ∈ R d } = 1 2 ( d 2 + d ) . Then 1 d ′ E max i X j , i h u i , u j i 2 6 ˜ O ( n ) . Pro of. E xpanding h u i , u j i 2 and using ˜ Σ + = d 2 + 2 d 2 Π sym − d 2 ΦΦ ⊤ , we get h u i , u j i 2 = d 2 + 2 d 2 h a i ⊗ a i k a i k 2 , a j ⊗ a j k a j k 2 i − d 2 2 = d 2 + 2 d 2 · h a i , a j i 2 k a i k 2 k a j k 2 − d 2 2 From elementary con ce n t ration, E max i , j h a i , a j i 2 / k a i k 2 k a j k 2 6 ˜ O (1 / d ), so the lemma follows by elementary manipulations. W e ne ed the following bound on the de viation from expe ctation of a tall matrix with inde pen- dent columns. Theorem C.9 (Theorem 5.62 in [ V er10 ]) . Let A be an N × n matrix ( N > n ) whose columns A j ar e independent isotr opi c random vectors in R N with k A j k 2 = √ N almost sur ely . Con sider the incoherence paramete r m def = 1 N E max i ∈ [ n ] X j , i h A i , A j i 2 . Then E k 1 N A T A − Id k 6 C 0 q m log n N . 60 W e are now prepared to handle the case o f S = [ n ] via spectral concen t ration for matrices with independe nt columns, The orem C.9 . Lemma (Res t atement of Lemma 5.9 ) . Let a 1 , . . . , a n ∼ N (0 , 1 d Id d ) be independent random vectors with d 6 n. Let R : = √ 2 · (( E ( aa ⊤ ) ⊗ 2 ) + ) 1 / 2 for a ∼ N (0 , Id d ) . For S ⊆ [ n ] , let P S = P i ∈ S ( a i a i ⊤ ) ⊗ 2 and let Π S be the pr ojector in to the subspace spanned by { Ra ⊗ 2 i | i ∈ S } . Then, with probabi lity 1 − o (1) over the choice of a 1 , . . . , a n , ∀ S ⊆ [ n ] . 1 − ˜ O ( n / d 3 / 2 ) · Π S RP S R 1 + ˜ O ( n / d 3 / 2 ) · Π S . Pro of of Lemma 5.9 . By Lemma C.6 it is enoug h to pr ove t he lemma in t h e case of S = [ n ]. For this we will use Theorem C.9 . Le t A be the matrix whose columns are given by a i ⊗ a i , so that P [ n ] = P = AA ⊤ . B ecause RAA ⊤ R and A ⊤ RRA have the same nonzero eigenvalues, it will be enough t o s h o w that k A ⊤ R 2 A − Id k 6 ˜ O ( √ n / d ) + ˜ O ( n / d 3 / 2 ) w ith proba bility 1 − o (1). (Since n 6 d we h ave √ n / d = ˜ O ( n / d 3 / 2 ) so this g ives th e t h e orem.) The columns of RA ar e independ ent, given by R ( a i ⊗ a i ). However , they do not quite satisfy the normaliza tion conditions needed for Theorem C.9 . Let D be the diagonal matrix whose i -th diagonal ent ry is k a i k 2 . Let ˜ Σ = E x ∼N (0 , Id) ( xx ⊤ ) ⊗ 2 / k x k 4 . Then by Lemma C.7 the matrix ( ˜ Σ + ) 1 / 2 D − 1 A has independ e nt columns from an isotropic dist ribution with a fixed norm d ′ . T oget her with Lemma C.8 this is enough to apply Theorem C.9 to conclude that E k 1 ( d ′ ) 2 A ⊤ D − 1 ˜ Σ + D − 1 A − Id k 6 ˜ O ( √ n / d ). By Markov’s inequ ality , k 1 ( d ′ ) 2 A ⊤ D − 1 ˜ Σ + D − 1 A − Id k 6 ˜ O ( √ n / d ) w ith proba bility 1 − o (1). W e w ill show ne xt that k A ⊤ R 2 A − 1 ( d ′ ) 2 A ⊤ D − 1 ˜ Σ + D − 1 A k 6 ˜ O ( n / d 3 / 2 ) with p roba bility 1 − o (1); t he lemma then follows by t riangle inequality . The expression inside the norm exp and s as A ⊤ ( R 2 − 1 ( d ′ ) 2 D − 1 ˜ Σ + D − 1 ) A . and so k A ⊤ R 2 A − 1 ( d ′ ) 2 A ⊤ D − 1 ˜ Σ + D − 1 A k 6 k A k 2 k R 2 − 1 ( d ′ ) 2 D − 1 ˜ Σ + D − 1 k By Fact C.1 , with overwhelming probab ility k D − Id k 6 ˜ O (1 / √ d ). So k (1 / d ′ ) 2 D − 1 ˜ Σ + D − 1 − (1 / d ′ ) 2 ˜ Σ + k 6 ˜ O (1 / √ d ) w .o v .p.. W e recall from F act C.4 , g iven that R = √ 2 · ( Σ + ) 1 / 2 , th at R 2 = Π sym − 1 d + 2 ΦΦ ⊤ and 1 ( d ′ ) 2 ˜ Σ + = d + 2 d + 1 Π sym − 1 d + 1 ΦΦ ⊤ . This implies that k R 2 − (1 / d ′ ) 2 ˜ Σ + k 6 O (1 / d ). Finally , by an e asy application of Proposition A.7 , k A k 2 = k P i ( a i a ⊤ i ) ⊗ 2 k 6 ˜ O ( n / d ) w .ov .p.. All toge ther , k A ⊤ R 2 A − 1 ( d ′ ) 2 A ⊤ D − 1 ˜ Σ + D − 1 A k 6 ˜ O ( n / d 3 / 2 ). D Concentration bounds for tensor principal component analysis For convenience, we restate Lemma 6.5 he re. Lemma D.1 (Restatement of Lemma 6.5 ) . F or any v, with high proba bility over A , the following occur: X i T r( A i ) · A i 6 O ( n 3 / 2 log 2 n ) X i v ( i ) · A i 6 O ( √ n log n ) 61 X i T r( A i ) v ( i ) · vv T 6 O ( √ n log n ) . Pro of of Lemma 6.5 . W e beg in with the te rm P i T r( A i ) · A i . It is a sum of iid matrices T r( A i ) · A i . A routine compu tation gives E T r( A i ) · A i = Id. W e will u se the truncated matrix Bernstein’s inequality ( Proposition A.7 ) to bound k P i T r( A i ) A i k . For notational convenience, let A be distributed like a generic A i . By a union bound , we have both o f the following: P k T r( A ) · A k > tn 6 P | T r( A ) | > √ tn + P k A k > √ tn P k T r( A ) · A − Id k > ( t + 1) n 6 P | T r( A ) | > √ tn + P k A k > √ tn . Since T r( A ) the su m of iid Ga ussians, P ( | T r( A ) | > √ tn ) 6 e − c 1 t for some const ant c 1 . Similarly , since the maximum e ige nvalue of a matrix with iid entries has a subgaussian tail, P ( k A k > √ tn ) 6 e − c 2 t for some c 2 . All toget her , for some c 3 , we get P ( k T r( A ) · A k > tn ) 6 e − c 3 t and P ( k T r( A ) · A − Id k > ( t + 1) n ) 6 e − c 3 t . For a positive parameter β , let I β be the indicator variable fo r the event k T r( A ) · A k 6 β . Then E k T r( A ) · A k − E k T r( A ) · A k I β = Z ∞ 0 h P ( k T r A · A k > s ) − P ( k T r A · A k I β > s ) i d s = β P ( k T r A · A k > β ) + Z ∞ β P ( k T r A · A k > s ) d s 6 β e − c 3 β/ n + Z ∞ β P ( k T r A · A k > s ) d s = β e − c 3 β/ n + Z ∞ β/ n P ( k T r A · A k > tn ) n d t 6 β e − c 3 β/ n + Z ∞ β/ n ne − c 3 t d t = β e − c 3 β/ n + n c 3 e − c 3 β/ n . Thus, for some β = O ( n log n ) we may take the parameters p , q of Proposition A.7 to be O ( n − 150 ). The only thing t hat remains is to bound the parameter σ 2 . Since ( E T r( A ) · A ) 2 = Id, it is enough just to bound k E T r( A ) 2 AA T k . W e use again a union bound : P ( k T r( A ) 2 AA T k > tn 2 ) 6 P ( | T r( A ) | > t 1 / 4 √ n ) + P ( k A k > t 1 / 4 √ n ) . By a similar argument as before, using the Gaussian tails of T r A and k A k , w e get P ( k T r( A ) 2 AA T k > tn 2 ) 6 e − c 4 √ t . Then s tarting out with the triangle inequality , σ 2 = k n · E T r( A ) 2 AA T k 6 n · E k T r( A ) 2 AA T k = n · Z ∞ 0 P (T r( A ) 2 AA T > s ) d s = n · Z ∞ 0 P (T r( A ) 2 AA T > tn 2 ) n 2 d t 62 6 n · Z ∞ 0 e − c 4 √ t n 2 d t = n · − 2 n 2 ( c 4 √ t + 1) c 2 4 e − c 4 √ t t = ∞ t = 0 6 O ( n 3 ) . This gives that with high probab ility , X i T r( A i ) · A i 6 O ( n 3 / 2 log 2 n ) . The other matrices are easier . First of al l, we not e that the matrix P i v ( i ) · A i has indepen d ent standard Gaussian entries, so it is standard th at with high probability k P i v ( i ) · A i k 6 O ( √ n log n ). Second, w e have X i v ( i ) T r ( A i ) vv T = vv T X i v ( i ) T r ( A i ) . The rando m variable T r( A i ) is a centered Gaussian with vari ance n , and since v is a unit vecto r , P i v ( i ) T r ( A i ) is also a cente red Gauss ian with variance n . So with high probabil ity we get vv T X i v ( i ) T r ( A i ) = X i v ( i ) T r ( A i ) 6 O ( √ n log n ) by standard est imates . This complete s t h e proof. 63
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment