Structure from Local Optima: Learning Subspace Juntas via Higher Order PCA

Structure from Lo cal Optima: F actoring Distributions and Learning Subspace Jun tas (Extended v ersion) San tosh V empala and Yin g Xiao Sc ho ol of Computer Science Georgia Institute of T ec hnology { vemp ala,yi ng.xi ao } @gatech.edu Octob er 31, 2018 Abstract Independent Comp onent Analysis (ICA), a well-known approa ch in statistics, assumes that data is generated b y a pplying a n aﬃne tra nsformation of a fully independent set of ra ndom v ariables, a nd aims to recover the orthog onal basis corresp onding to the indep endent rando m v ariables. W e consider a generaliza tion of ICA, wherein the data is g enerated as an aﬃne transformatio n applied to a pro duct of distributions on tw o orthog onal subspace s, and the goal is to r ecov er the tw o comp onent subspaces. Our main result, extending the work of F rieze, Jerrum and Kannan, is an algorithm for genera lized ICA that uses lo cal o ptima of high moments and recovers the co mpo nent subspaces. When one comp onent is on a k -dimensio nal “relev ant” subspace a nd satisﬁes some mild assumptions while the other is “ noise” mo deled as an ( n − k )- dimensional Gaussia n, the complexity o f the algorithm is T ( k , ǫ ) + p oly ( n ) where T depe nds only on the k -dimensional distribution. W e apply this result to learning a k - subsp ac e junta , i.e., an unkno wn 0-1 function in R n determined b y an unkno wn k - dimensional subspace. This is a common generaliz ation of lea rning a k -junta in R n and of le a rning an intersection of k halfspaces in R n , tw o impor tant problems in learning theory . Our main tools are the use of lo c al optima to reco ver glo bal structure, a gradient-based algorithm for o ptimization ov er tensor s, and an appr oximate polyno mial ident ity test. T ogether, they signiﬁca nt ly extend ICA and the class of k -dimensional lab eling functions that can be learned eﬃciently . 1 In tro d uction Indep endent Comp onent Analysis (ICA) [25 ] is a statistical app roac h that mo dels data in R n as generated by a distribution consisting of n linear com bin ations of n in dep end en t un iv ariate comp onent distr ib utions, y = Ax with x, y ∈ R n , x i are indep endent random v ariables and A is an in vertible n × n matrix; in other w ords, an aﬃne transformation of a pro d uct d istr ibution. The goal is to reco v er the underlying comp onent d istributions of the x i giv en only a set of observ ations y . Sp ecial cases of I C A are of inte r est in m an y application areas with large or h igh-dimensional data sets [24]. An imp ortant feature of IC A, as w e will presen tly see, is that it can pro vide an insightful represent ation ev en when P r incipal Comp onen t Analysis (PCA) do es not. In this pap er, w e consider gener alize d ICA , where instead of n in dep end en t one-dimensional distributions, w e only assume two indep endent distributions on c omp lementary subsp ac es . This natural extension of ICA pro vid es a common generalization of t wo fu ndamenta l problems in high- dimensional learning, wher e one sees lab eled p oints (examples) from an unkn own distribution la- b eled b y an u nknown 0-1 fun ction and the goal is to ﬁ nd a lab eling function that agrees on most of the distribution [40]. Th e ﬁrst, introd uced b y A. Blum [8], is learning a function of k coord inates in R n , known as a k -ju nta. The second is the problem of learning an int ersection of k halfspaces in R n [6, 7] ( k = 1 is the classic problem of learning a halfspace). Although the complexit y of b oth problems is far from settled, there has b een m uch p r ogress in r ecen t yea r s for sp ecial cases, as we d iscuss in Section 1.1. Indeed, generali zed ICA can b e applied t o the problem of lea r ning an unknown f unction of an un kno wn k -dimensional sub space of R n , p ro vided th e distribu tion on p oints can b e factored int o in dep end en t d istributions on the k -dimens ional “relev an t” s ubspace and the ( n − k )-dimensional “noise” subspace. W e giv e an algorithm for generalized ICA that can b e view ed as a tensor ve rs ion of PCA applied to higher moments, sp eciﬁcally lo cal o p tima of momen t fu n ctions to infer the comp onent distributions. The algorithm uses a second-ord er gradien t descen t metho d and an app ro ximate v ersion of the Sc h wartz-Zipp el p olynomial iden tit y test, while its analysis needs to ols from con v ex geometry and probabilit y . Before w e describ e our r esults and tec hn iques in detail, we summarize the kno wn algo r ithmic approac hes to ICA. F or the p roblem of ident ifyin g the sour ce comp onen ts giv en only their linear co mbinations as data, PCA su ggests the approac h of us in g principal comp onents of the data as candidates for the comp onent directions. This w ould indeed reco ve r the comp onents if the co v ariance matrix of the data has distinct nonzero eigen v alues. Ho w eve r , if v a r iances along t wo or more directions are equal, then the prin cipal comp onen ts are n ot un iquely deﬁn ed and PCA d o es n ot work. In more detail, assume th at the data is cent ered, i.e., its mean is zero. Then PCA can b e view ed as ﬁnd ing v ectors on the unit sph ere that are local optima of the second moment of th e pro jection, max x ∈ S n − 1 k Ax k 2 where A is m × n w ith eac h ro w b eing a data p oint. These maxima are eigen v alues o f A T A , the co v ariance matrix of A and hence attain at most n distinct v alues. The v alues and the corresp onding v ectors can b e appro ximated to arbitrary accuracy eﬃcient ly . What to do when eigen v alues are rep eated? T o add ress this, the idea in ICA is to consider a broader class of functions to optimize. A natural c hoice is h igher moments. The use of lo cal optima of f ourth momen ts w as suggested as early as 199 1 [30, 15]. When the comp on ent distrib utions are suﬃcien tly far from b eing Gaussian, the lo cal optima of a family of f u nctions on the u n it sph ere are the comp onent directions [37, 17] (if comp onen t distribu tions are Gaussians, then their linear com binations are also Gaussian and the linear transf ormation A migh t not un iqu ely deﬁ ned). Th is approac h can b e turned into an a p olynomial-time algorithm for un ra veli n g a pro d u ct distribution of a wide class of one-dimensional d istributions. 1 W e now describ e generalized IC A, which signiﬁcan tly w eak ens the ICA assu mption of a f u ll pro du ct distribution. Na m ely , we assume that the distr ib ution F in R n can b e factored in to a pro du ct of tw o indep end en t marginal distributions F V and F W on unknown orthogonal subs p aces V and W = V ⊥ , i.e., F = F V F W . W e c all such an F factorizable . T h u s, a random p oin t in F is generated by ﬁrst pic king its co ordinates in V according to F V and then ind ep endently p ic king co ordinates in W according to F W . The corresp onding problem is the follo wing. Problem 1 (F actoring distributions) . Given (unla b ele d) samples fr o m a factorizable distribution F = F V F W over R n (with V and W u nknown), r e c over a factorization of F . If F in f act factorizes fur ther in to pro du ct of more distributions, or ev en a full p ro duct distri- bution of one-dimensional component distrib u tions as in ICA, an algorithm for the abov e problem can b e applied recursively to ﬁnd th e full factorizati on. W e will giv e an algorithm for this p r oblem under further mild assump tions (roughly sp eaking, at least one of F V , F W is suﬃcien tly diﬀerent from b eing a Gaussian). Our approac h is b ased on viewing PC A as a second momen t optimization problem, then extending this to higher momen ts (alternativ ely , optimizatio n o ve r tensors). Al- though such tensor optimiza tion is intract able in general, f or our setting, it will turn out that lo c al optima pro vide v aluable information, and can b e appro ximated eﬃcien tly . The factoring p r oblem ab o v e has direct applications to learnin g in high dimension. Let π V denote pro jection to a sub s pace V . W e consid er labeling fu nctions ℓ : R n → { 0 , 1 } of the form ℓ ( x ) = ℓ ( π V ( x )). W e are giv en p oin ts according to some distrib ution F o v er R n along with their lab els ℓ ( x ) = ℓ ( π V ( x )) for some unknown subspace V of d imension k (the ‘relev an t’ subs p ace), and wish to learn the unkno wn concept ℓ , i.e ., ﬁnd a fu nction t h at ag rees with ℓ on most of F . W e call this the problem of learning a k - subsp ac e junta . W e further assume that F is factorizable as F = F V F W , with W = V ⊥ (the ‘irrelev an t’ subspace). The justiﬁcation f or this factorizabilit y assumption is that co ord inates in the W sub space are n ot relev an t to the lab eling fun ction and can b e considered to b e noisy attributes. Th e full statemen t of our lea r ning problem is as follo ws: Problem 2 (Learning a k -sub space junt a) . F or ǫ, δ > 0 , given samples dr awn fr om a factorizable distribution F = F V F W , and lab ele d by a ℓ = ℓ ◦ π V , ﬁnd a 0 - 1 function f such that with pr ob ability at le ast 1 − δ , Pr F ( ℓ ( x ) 6 = f ( x )) ≤ ǫ. Our algorithm for generalized IC A leads to an eﬃcien t algorithm for learnin g k -subs pace junta s for a large class of am bien t d istributions F . 1.1 Related w ork Jutten and Herault formalized the ICA problem [25] and mentio n in their pap er that v arian ts of this problem had app eared in a v ariet y of diﬀeren t ﬁelds prior to this (the earliest suc h men tion is in [3]). The notion th at r andom v ariables should b e far from b eing Gaussian p erv ades ICA r esearch. By the cen tral limit theorem, sums of indep endent r andom v aria b les con v erge to a Gaussian, whereas individually the laten t r andom v ariables are not Gaussian. Thus ﬁnding directions that maximize some notion of non-Gaussianit y might r ev eal the laten t v a r iables. Th is intuition is form alized b y in tro d ucing functions whic h ser ve as a pro xy f or non-Gaussianit y , called “con trast functions” in the ICA literature. The deﬁnition of a cont r ast fun ction is that maximizing a contrast fun ction will give an indep endent comp onent. Some examples of co ntrast fun ctions include the kurtosis (4th ord er analogue of v ariance )[30 , 15], v arious cumulan ts, and fu nctions b ased on th e so-calle d ne gentr o py 2 ([16]). Additionally , there are a v ariet y of tensor metho ds and maximum lik elihoo d metho ds used [14, 13, 5]. While there are m any algorithms prop osed for ICA, some of wh ic h app ear to p erform w ell in p ractice (e.g., F astICA [23]), there are almost no explicit time complexit y b ounds. F rieze, Jerrum and Kannan [19] w ere the ﬁrst to give a p olynomial co mp lexit y b oun d for this sp ecial case of IC A, namely a pro duct of uniform distribu tions on i nterv a ls, which can also be view ed as the problem of lea rn ing an unknown parallelopip ed from samples. They used fourth moments, an idea present ed earlier in sev eral pap ers in the ICA literature; the k ey stru ctural lemma is already presen t in [17], which w as ins pired by [37] (Lemma 3 of our pap er is a generalization). S ubsequently , Nguy en and Regev [33] simpliﬁed F r ieze et al’s gradien t descen t algorithm and pr o vided some cr y p tographic applications. A diﬀeren t motiv atio n for our w ork comes from co mp utational learning theory , where learning a k - junta is a fu ndament al problem [8]. In this pr oblem, one is giv en p oin ts from some distribution o v er { 0 , 1 } n , lab eled b y a Bo olean function that dep ends only on k of the n coord inates. The goal is to learn the relev an t k co ord inates and the lab eling fu nction. Naiv e en um er ation of k subsets of the co ordinates lea d s to an algorithm of complexit y rou gh ly n k . Mossel et al [32] ga v e an algorithm of complexit y roughly O ( n 0 . 7 k ) assuming the un if orm distribution o v er { 0 , 1 } n . F or other sp ecial cases of Problem 2 , previous authors ha ve applied stand ard lo w-dimensional representa tion te chniques, lo w-degree p olynomials, random pro jection and Principal C omp onent Analysis (PCA) to identify V under strong distrib utional assump tions [4, 27, 6, 41, 43]. The strongest result in this line ac hiev es a ﬁxed p olynomial d ep enden ce on n by app lyin g PCA to learn con ve x concepts ov er Gaussian input distrib utions [42]. Unfortunately , standard P CA do es not w ork for other distributions or more general concept classes, in part b ecause PC A d o es n ot pr ovide useful in formation when the co v ariance matrices of the p ositive and negativ e samples are equal. In fact, the pr oblem app ears to b e qu ite hard with no assumptions on the input distribu tion, ev en for small v alues of k , e.g., a single h alfsp ace can b e P AC-le arn ed via linear programming, but learning an in tersection of t wo halfspaces (a 2- su bspace jun ta) in p olynomial time is an op en problem. There ha ve b een a num b er of extensions of PCA to tensors [29] analogous to S VD, although no metho d is known to hav e p olynomial complexit y . One approac h is to view PCA as an optimization problem. The top eigenv ecto r is the solution to a matrix optimization problem: max k v k = 1 v T Av = X i 1 ,i 2 ( A ) i 1 ,i 2 v i 1 v i 2 where A is the co v ariance matrix. A higher moment metho d optimizes the multi linear form deﬁ ned b y the tensors of h igher momen ts: max k v k = 1 A ( v , . . . , v ) = X i 1 ,...,i r A i 1 ,...,i r v i 1 . . . v i r . Unlik e th e bilinear case, ﬁnding th e global maximum of a m ultilinear form is h ard. F or α > 16 / 17, it is NP-hard to appr o ximate th e optim u m to b etter than factor α ⌊ r/ 4 ⌋ [10], and the b est kno w n appro ximation f actor is roughly n r / 2 . Sev eral lo cal search metho ds ha ve b een prop osed f or this problem as well[ 28 ]. 1.2 Results T o state our results formally , w e n eed to deﬁne the distance of a distrib ution from a Gauss ian via momen ts. F or a random ve ctor x ∈ R n with distribution F , th e m th momen t tensor M m is a tensor 3 of order m with n m en tries giv en by: M m i 1 ,...,i m = E ( x i 1 . . . x i m ) . Let Γ n b e the standard Gaussian distribu tion o v er R n and γ m denote the m th momen t of a standard Gaussian random v ariable: γ m = ( m − 1)!! when m is ev en and 0 when m is o dd. The m th -moment distanc e of t w o distribu tions F, G ov er R n is deﬁned as d m ( F , G ) = max k u k =1   E F  ( x T u ) m  − E G  ( x T u ) m    = k M m F − M m G k 2 . W e say th at a distribution F o ver R k is ( m, η )- moment-distinguishable along u nit v ector u ∈ R k , if either there exists j ≤ m :   E F  ( x T u ) j  − γ j   ≥ η or there exist un it v ectors { v 1 , . . . , v t } ⊂ u ⊥ where t ≤ m suc h that   E F  ( x T u ) m − t Π t i =1 ( x T v i )  − E F  ( x T u ) m − t  E  Π t i =1 ( x T v i )    ≥ η . In wo r d s, F diﬀers f rom a Gaussian e ither alo n g some dir ection u , or by exhibiting a correlation b et ween its marginal along u and v ectors orth ogonal to u (for a Gaussian su c h subsets ha ve zero correlation). Th e rationale for this deﬁn ition is that if t w o contin uous distributions are iden tical (or close) in man y momen ts, then one would exp ect them to b e close in L 1 distance. F or example, the follo wing holds for one-dimensional logconca v e distributions via an explicit b ound on the num b er of momen ts required. Lemma 1 ( L 1 distance from Gaussian) . Fix m and ǫ > 0 . L et f : R → R b e an isotr o pic lo gc onc a ve density, whose ﬁrst m moments satisfy | E f ( x m ) − γ m | < ǫ , then: k f − g k 1 ≤  c m 1 / 8 + c ′ me m ǫ 2  1 / 2 log m ≤ c  log m m 1 / 16 + ǫme m  W e are now ready to s tate our ﬁrst main result: we can eﬃcien tly factorize distributions assum- ing the distribu tion on the r elev a nt subs p ace is momen t-distinguish able and the d istribution on the irrelev an t noisy attributes is some Gaussian. In what follo ws, it migh t b e illus trativ e t o regard k as a constant indep end en t of n . Let C F ( n, m, ǫ ) b e the num b er of samples needed to estimate eac h en try of the m th momen t tensor of F to within additiv e error ǫ and M b e an upp er b ound on the m th momen t along an y direction. Theorem 1 (F actoring, Gaussian noise) . L et F = F V F W b e a distribution over R n wher e V is a subsp ac e of dimension k , and F W = Γ n − k . Supp ose that F V is ( m, η ) -moment-distinguishable for e ach unit ve ctor u ∈ V . Then for any ǫ, δ ≥ 0 , in time C F ( n, m, ǫ )p oly( n, η , 1 /ǫ, log (1 /δ ) , M ) , Algor ithm F actorUnderGaussian ﬁnds a subsp ac e U of dimension at most k such that for j ≤ m , d j ( F , F U F U ⊥ ) ≤ j ( M + γ j ) ǫ with pr ob ability at le ast 1 − δ . In addition, for a ny ve ctor in u ∈ U , k π V ( u ) k ≥ 1 − ǫ . Next w e turn to learning. F or a distribution F and a k -dimensional concept class H , w e say that the triple ( k , F, H ) is ( m, η ) -moment-le arnable if: 1. F = F V F W is a factorizable distrib ution with dim ( V ) = k . 4 2. H is a set of k -subspace jun tas whose relev a nt subspaces are cont ained in V . 3. F or ℓ ∈ H with minimal (with r esp ect to dimension) relev an t subsp ace P ⊆ V , for eac h unit v ector u ∈ P , either F V or F + V (the distribution ov er the p ositiv e samples) is ( m, η )-momen t distinguishable along u . In words, the third condition sa ys that if F V resem bles a Gaussian in its ﬁrst m momen ts along ev ery direction, then F + V do es not. W e will s ee examples o f conce p t classes a n d distribu tions for whic h m is b ounded und er this deﬁnition. Indeed, we conjecture that a concept class H with b ound ed V C d imension d is ( m, η ) momen t-learnable where m dep ends only on d and η . T o state our learning gu arantee, we n eed one more deﬁn ition: A trip le ( k , F , H ) is called r obust if for any sub space U of dimension at most k with orthonormal basis { u i } where   u T i π V ( u i )   ≥ 1 − ǫ , then ℓ ( π U ( x )) lab els correctly 1 − g ( ǫ ) fraction of R n under F where g ( ǫ ) < ǫ c for constant c > 0 and su ﬃcien tly small ǫ . T he deﬁ nition requires the d istribution F and la b eling function ℓ to b e robust un der small p erturb ations of the relev an t subspace. O nce we ident ify the r elev a nt subspace appro ximately , w e can p ro ject samples to it and use an algorithm that can learn ℓ in spite of a g ( ǫ ) fraction of noisy lab els. Theorem 2 (Learning, Gaussian n oise) . L et ǫ, δ > 0 , let ℓ ∈ H wher e ( k , F, H ) is ( m, η ) -moment- le arna ble and r obust, and let F W = Γ n − k b e Gaussian. Supp ose that we ar e gi v en lab ele d examples fr om F , th en Al gorithm L e arnUnderGaussian i dentiﬁes a subsp a c e U and a hyp othesis h such that h c orr e ctly classiﬁes 1 − ǫ of F ac c or ding to ℓ with pr ob ability at le ast 1 − δ . The time and sample c ompl exity of the algorithm ar e b ounde d by T ( k , ǫ ) + C F ( n, m, ǫ )p oly( n, η , k , 1 /ǫ, log (1 /δ ) , M ) wher e T is the c omplexity of le arning the k -dimensional c onc e pt class H . W e note here that for a concept class of V C-dimension d , a stand ard red uction implies that the complexit y of learning with ǫ arbitrary noise is at most (2 /ǫ ) O ( d log(1 /ǫ )) times the complexit y of learnin g with no noise (Pr op osition 9). Ou r algorithms run in p olynomial-time in n pro vided ( k , F , H ) satisfy the momen t-learnable condition. Some sp ecial cases of this result were p reviously kno wn, e.g., wh en F is a Gaussian and H is a con vex c oncept class [27, 42]. Th e application of PCA to learning conv ex b o dies in [42] can b e view ed as the assertion that conv ex concepts in R k are momen t-learnable: u nder a Gaussian distribution, the p ositiv e distrib ution F + has v ariance less than 1 along an y direction. The follo wing t wo examples furth er illustrate Th eorem 2 . • When th e full distribution in the relev a nt subsp ace is uniform in an ellipsoid, then rob u st concept classes can b e learned in time T ( k , ǫ ) + C k ,ǫ · n 2 . Here T dep ends on the k and concept class, and C is a constan t ﬁxed by k and ǫ and indep endent of the concept class. Thus we can learn general concept classes b ey ond con ve x b o dies and lo w-degree p olynomials for uniform distributions o v er a b all in the relev an t subsp ace. • When t h e distribu tion on the p ositiv e examples F + has b ounded supp ort, i.e., the positiv e lab els lie in a ball of r adius r ( k ), suc h robust concepts can b e learned in time T ( k , ǫ ) + C k ,ǫ · n O ( r ( k ) 2 ) for an arbitrary distribu tion in the relev an t subspace. Previously , f or logconca v e F , learning an intersectio n o f k half-spaces was kno wn to h a ve complexit y growing as n O ( k ) [43, 26]. 1.3 T ec hniques Our strategy is to iden tify th e r elev an t subspace V is to examine higher momen ts of the distrib u tion. 5 As men tioned ea rlier, our app roac h is in spired by viewing PCA as ﬁnd ing th e global maxima and minima of the bilinear f orm deﬁ n ed by th e co v ariance mat r ix. Instead of trying to compu te glob al optima of the multilinear form, we u se lo c al optima. Th ese lo cal optima turn out to b e highly structured. The use of lo cal optima can b e view ed as an eﬀectiv e realization of higher-order PCA that leads to eﬃcient alg orithms . Pr evious alg orithmic problems h a v e all required the use of global optima — for example, the planted clique algorithms of [20, 11]. W e prov e that lo cal optima of the m th momen t f m ( u ) = E  ( x T u ) m  m ust lie e ntirely in V or its complemen t W (Lemma 4 ) unless its ﬁ r st m moments are identic al to those of a Gaussian. T o make these ideas algorithmic, we use a lo cal searc h metho d th at increases the function v alue b y p erforming ﬁr st-order m o v es along the gradient and then second-order mo v es in the dir ection of the top eigenv ecto r of th e Hessian matrix. These second-order mo v es allo w us to av oid sadd le p oints and other critical p oint s whic h arise in higher dimensions. Saddle p oin ts ha v e a gradien t of zero and lo ok lik e maxima in some directions and minima in others. While s earching f or a lo cal maxim um, one could end up in a sadd le p oin t. The top eigen vec tor of the Hessian sh o ws directions of greatest quadratic increase, an d hence will mo v e u s from the saddle p oint to a true lo cal maximum. Another comp onen t in our a lgorithms is an appro ximate v ersion of the well- kn o wn Sc h wartz- Zipp el p olynomial id en tit y test. Ob serving that f m ( u ) is a p olynomial of d egree m in the v ariables u 1 , . . . , u m , in principle we can test ( whether f m is a constan t function by ev aluati n g f m at random p oints. W e u se a robust v ersion of this test (Lemma 12) d eriv ed via a r esult of C arb ery and W right [12]. 2 Structure of lo cal optima W e derive a rep r esen tation of f m ( u ) = E  ( x t u ) m  in L emm a 3. Using th is r ep resen tation, we show in Lemma 4 that eac h local optim um lies in V or W exclusiv ely . Finding a sequence of orthogonal lo cal optima will giv e u s basis vect ors for the r elev an t subspace. F or con venience we often use u V = π V ( u ) for the pro jection of u onto V , u W for the p ro jection on to the orthogonal subspace W , and u 0 for the u nit v ector in the directio n of u . W e ma y assume that E ( x ) = 0: if otherwise, then we can apply a translation x − E ( x ). Lemma 2 (T r anslation of pro d uct d istributions) . L et x ∈ R n b e a r andom ve ctor dr awn fr o m F = F V F W , a pr o duct distribution.Then x − E ( x ) has a pr o d u c t distribution over V and W . Pr o of of L emma 2. T ak e our translation y = T a ( x ) = x + a , for Bo r el sets B 1 and B 2 : Pr ( y V ∈ B 1 ∧ y W ∈ B 2 ) = Pr ( x V + a V ∈ B 1 ∧ x W + a W ∈ B 2 ) = Pr ( x V ∈ B 1 − a V ∧ x W ∈ B 2 − a W ) = Pr ( x V ∈ B 1 − a V ) Pr ( x W ∈ B 2 − a W ) = Pr ( y V ∈ B 1 ) Pr ( y W ∈ B 2 ) e ca n com b ine this with a linear transformatio n to obtain a n isotropic distribu tion, giv en b y y = Σ − 1 / 2 ( x − µ ) w here µ is the exp ectation v ector. T h is s impliﬁes subsequent calculations b ecause the co v ariance matrix for y is I n . T h e follo wing lemma, inspired b y [19, 17, 37], pro vides the main insigh t for the structural theorem. 6 Lemma 3 ( Repr esen tation of f m ) . L et F = F V F W . Supp ose that x has the same j th moments as a Gaussian for al l inte gers j < m , then for u ∈ S n − 1 : f m ( u ) = k u V k m  E  ( x T V u 0 V ) m  − γ m  + k u W k m  E  ( x T W u 0 W ) m  − γ m  + γ m (1) Pr o of of L emma 3. Consider the case when m is o d d: f m ( u ) = E  ( x T u ) m  = E   ( x V + x W ) T ( u V + u W )  m  = E  ( x T V u V ) m  + E  ( x T W u W ) m  + m − 1 X i =1  m i  E  ( x T V u V ) i ( x T W u W ) m − i  = E  ( x T V u V ) m  + E  ( x T W u W ) m  + m − 1 X i =1  m i  E  ( x T V u V ) i  E  ( x T W u W ) m − i  The last line follo ws by app lying the indep endence of r andom v ariables whic h dep en d only on the V and W su bspaces. Eac h term in the last sum con tains an o d d momen t of a Gaussian, hence: f m ( u ) = k u V k m E  ( x T V u 0 V ) m  + k u W k m E  ( x T W u 0 W ) m  . (2) When m is ev en, we n eed the follo w in g form ula: m X i =0  m i  k u V k i k u W k m − i γ i γ m − i = γ m This follo ws from E (( aX + bY ) m ) = γ m where a 2 + b 2 = 1 an d X and Y are ind ep endent standard normal v ariables: f m ( u ) = m X i =0  m i  k u V k i k u W k m − i E  ( x T V u 0 V ) i  E  ( x T W u 0 W ) m − i  = k u V k m E  ( x T u 0 V ) m  + k u W k m E  ( x T u 0 V ) m  + m − 1 X i =1  m i  k u V k i k u W k m − i γ i γ m − i = k u V k m  E  ( x T u 0 V ) m  − γ m  + k u W k m  E  ( x T u 0 V ) m  − γ m  + m X i =0  m i  k u V k i k u W k m − i γ i γ m − i = k u V k m  E  ( x T u 0 V ) m  − γ m  + k u W k m  E  ( x T u 0 V ) m  − γ m  + γ m Using this r epresent ation, we can c haracterize all lo cal optima of f m . Lemma 4 (Supp ort) . L et distribution F = F V F W have the same ﬁrst m − 1 moments as a Gaussian but a diﬀer ent m th moment. Then for a lo c al maximum (lo c al minimum) u ∗ of f m r estricte d to the unit spher e, wher e f m ( u ∗ ) > γ m ( f m ( u ∗ ) < γ m ), ei ther k u ∗ V k = 1 or k u ∗ W k = 1 . 7 Pr o of of L emma 4. Consider the c u rv e C = { s ( u ∗ V ) 0 + t ( u ∗ W ) 0 : s 2 + t 2 = 1 , s ≥ 0 , t ≥ 0 } . T h e p oint u ∗ lies on C : th u s if u ∗ is a lo cal maximum in full space, it h ad b etter b e a lo cal maxim u m on C . On the other han d , we will sho w that there are no lo cal maxima interior to C , when ce w e m ust ha ve k u ∗ V k = 1 or k u ∗ W k = 1. Let u s denote a v =  E  ( x T u 0 V ) m  − γ m  and a w =  E  ( x T u 0 w ) m  − γ m  . By th e assumption that f m ( u ∗ ) > γ m , we kno w that least one of a v or a w is p ositive . S upp ose that s 6 = 0 and s 6 = 1: w e form the asso ciated Lagrangian with p ositiv e real m u ltiplier λ : L = a v s m + a w t m + γ m − λ ( s 2 + t 2 − 1) A t ev ery critical p oin t in the interior of C , w e m us t hav e D L = 0:  ma v s m − 1 − 2 λs ma w t m − 1 − 2 λt  = 0 If w e consider on ly the inte r ior critical p oint s where s, t > 0, then b oth a v > 0 and a w > 0 (otherwise w e w ould ha ve λ > 0 and λ ≤ 0). Th ere is only one solution: s = a 1 / ( m − 2) w q a 2 / ( m − 2) v + a 2 / ( m − 2) w t = a 1 / ( m − 2) v q a 2 / ( m − 2) v + a 2 / ( m − 2) w λ = ( m/ 2)( a v s m + a w t m ) If w e no w consid er th e Hessian on the tangen t plane orthogonal to the gradien t of the constraint (equiv ale nt to considering the b ordered Hessian) , we see that it is p ositive deﬁnite for m > 3 (when m = 3, diﬀeren tiating a v s 3 + a w (1 − s 2 ) 1 . 5 t wice at the critical p oint giv es a p ositive v alue): D 2 L =  m ( m − 1) a v s m − 2 − 2 λ 0 0 m ( m − 1) a w t m − 2 − 2 λ  = m ( m − 3) a v a w h a 2 / ( m − 2) v + a 2 / ( m − 2) w i ( m − 2) / 2 I > 0 In particular, there are no local maxima in terior to C , that is, k u ∗ V k = 1 or k u ∗ V k = 0. 3 Finding a basis Our t wo basic algorithms exploit th e pr op erty that lo cal optimum to f m ( u ) = E  ( x T u ) m  on the unit sphere must lie in either V or W (Lemma 4). In this section, w e assume that the algorithms ha ve acc ess to exac t momen t tensors and can compute exac t local optima. W e p ro vide eﬃcien t algorithms (with error analysis) in Section 4. The basic idea of the algorithm is to start with a r andom direction and ev aluate the j ’th moment in that direction. If it is d iﬀeren t from a Gaussian we go to a lo cal max or lo cal m in (whic h ev er k eeps it diﬀeren t from a Gaussian) and thus ﬁ n d a v ector of in terest; if man y random unit v ectors ha ve Gaussian moments, then all directions ha v e Gaussian momen ts due t o the Sc hw artz-Zipp el Lemma and w e go to the next higher moment. At the end of the algorithm w e h a v e a su bset of an orthogonal b asis consistent with V and W , and t h e p rop erty that a ll orthogonal directi ons hav e Gaussian momen ts. Lemma 5 ( S c hw artz-Zipp el[36]) . L et P ∈ F [ x 1 , . . . , x n ] b e a nonzer o p olynomial of de gr e e dn ≥ 0 over ﬁeld F . L et S b e a ﬁnite subset of F and let r 1 , . . . , r n b e sele cte d r ando mly fr om S . Then: Pr ( P ( r 1 , . . . , r n ) = 0) ≤ d | S | . 8 Algorithm 1 FindBasis Input: Momen t b ound m , Distribution F 1: Orthonormal v ectors B ← φ , momen t tensor j ← 2. 2: while | B | < n and j ≤ m do 3: Comp ute th e j th momen t tensor M B j orthogonal to B , so that for any v ∈ B ⊥ , f j ( v ) = E  ( x T v ) j  = M B j ( v , . . . , v ). 4: if f j ( v / k v k ) ≡ γ j then 5: j ← j + 1 6: else 7: if f j ( v ) > γ j for some v then 8: Compute a local maxim u m u ∗ to f j starting from v . 9: else 10: Compute a lo cal min im um u ∗ to f j starting from v . 11: B ← B ∪ { u ∗ } . 12: return B F or Line 3, let A : R n → R n denote the linear map that pro jects orthogonal to B . T hen M ( Au, . . . , Au ) = X i 1 ,...,i m M i 1 ,...,i m ( Au ) i 1 · · · ( Au ) i m = X j 1 ,...,j m   X i 1 ,...,i m M i 1 ,...,i m A i 1 ,j 1 · · · A i m ,j m   u j 1 · · · u j m The identit y c heck in Line 4 is p er f ormed by selecting a rand om vec tor x with i.i.d. uniform co ordinates fr om { 1 , . . . , 2 m } and ev aluating th e p olynomial f j ( x/ k x k ) − γ j . Rep eating O (log ( n/δ )) times giv es a 1 − δ probabilit y of success. Algorithm FindBasis does not suﬃce on its o wn. Although eve r y dir ection orthogo n al to B lo oks Gaussian up to the m ’th moment, it is p ossible that some directions are correlated w ith v ectors in B . The next pr o cedure iden tiﬁes suc h directions. Theorem 3 (Find Basis) . L et F = F V F W b e a factorizable distribution over R n . Then, with pr o b ability at le ast 1 − δ , e ach ve ctor in the output of FindBasis lies in either V or W . Pr o of of The or em 3. F rom the ab ov e comment, at eac h s tep, with probabilit y at least 1 − δ /n (hence total failure probabilit y δ ), we are able to ﬁ nd a p oin t u where f r ( u ) 6 = γ m . In particular, if f r ( u ) > γ m , then w e ﬁnd a lo cal m aximum u ∗ . By Lemma 4 , u ∗ is con tained entirel y w ithin V or W . The analysis is identica l when our initial p oin t u satisﬁes f r ( u ) < γ m . Observe that F V \ span( B ) F W \ span( B ) is a factorizable distribution o ver π B ⊥ ( x ), and hence a lo cal optim um in B ⊥ also will lie in either V or W . The n ext theorem states that ExtendBasis ﬁn ds all vec tors wh ic h are correlated w ith S ⊆ V , and that all remaining vect ors at the end of the algorithm are uncorrelated u p to th e m th momen t. Theorem 4 (Basis Extension) . Th e output S ′ of ExtendBasis on input S ⊆ V , T ⊆ W satisﬁes: 1. S ⊆ S ′ ⊆ V . 2. F or { v t } ⊂ S ′ and { u i } ⊂ ( S ′ ) ⊥ : E j − l Y i =1 ( x T u i ) l Y t =1 ( x T v t ) ! = E j − l Y i =1 ( x T u i ) ! E l Y t =1 ( x T v t ) ! 9 Algorithm 2 ExtendBasis Input: Momen t b ound m , distribu tion F , candidate v ectors S and non-Gaussian directions T . 1: S ′ ← S , j ← 2. 2: while | S ′ | < k and j ≤ m do 3: for eac h choice (with rep etition) { v 1 , . . . , v l } ⊆ S ′ where 1 ≤ l < j . do 4: Compute the ( j − l ) tensor M S ′ ,T l,j so that for any u ∈ ( S ′ ∩ T ) ⊥ , g ( u ) = E  ( x T u ) j − l Y ( x T v t )  − E  ( x T u ) j − l  E  Y ( x T v t )  = M S ′ ,T l,j ( u, .., u, v 1 , . . . , v l ) . 5: if g ( u ) ≡ 0 then 6: Con tinue w ith next c hoice of { v i } . 7: else 8: if g ( u ) > 0 for an y u then 9: Compute a local maxim u m u ∗ to g starting with u/ k u k . 10: else 11: Compute a lo cal min im um u ∗ to g starting with u/ k u k . 12: S ′ ← S ′ ∪ { u ∗ } and r estart the wh ile loop with j = 3. 13: j ← j + 1. 14: return S ′ . Pr o of of The or em 4. The S c hw artz-Zipp el lemma retur ns a correct decision at eve r y iteration (there are at most n k of these, so if w e pic k our domain to b e of size 2 n k and run O (log n k /δ ) iterations eac h time, then w e ha ve a correct decision for all iterations with probabilit y at least 1 − δ . Let u ∗ b e a lo cal maxim um found by ExtendBasis using the j th momen t. Consider the { v 1 , . . . , v l } where u ∗ w as found . g ( u ) = E ( x T u ) ( j − l ) l Y t =1 ( x T v t ) ! − E  ( x T u ) ( j − l )  E l Y t =1 ( x T v t ) ! = j − l X i =0  j − l i  E  ( x T u W ) i  " E ( x T u V ) ( j − l − i ) l Y t =1 ( x T v t ) ! − E l Y t =1 ( x T v t ) ! E  ( x T u V ) ( j − l − i )  # Since u ∗ w as found at momen t j , then for all 0 < i < j − l : E ( x T u V ) ( j − l − i ) l Y t =1 ( x T v t ) ! = E l Y t =1 ( x T v t ) ! E  ( x T u V ) ( j − l − i )  Only the ﬁ r st and last terms surviv e: g ( u ) = E ( x T u V ) ( j − l ) l Y t =1 ( x T v t ) ! − E l Y t =1 ( x T v t ) ! E  ( x T u V ) ( j − l )  = k u V k j − l " E ( x T u 0 V ) ( j − l ) l Y t =1 ( x T v t ) ! − E l Y t =1 ( x T v t ) ! E  ( x T u 0 V ) ( j − l )  # Ha ving a p ositiv e lo cal maxim um imp lies that k u V k = 1. 10 F or the second part of this lemma: w e already kn o w th at all the r emaining v ectors must ha v e Gaussian moments. Fix j ≤ m and a choi ce of { v 1 , . . . v l } from S ′ and consider th e symmetric tensor ˆ M represented b y f ( u ) − E  ( x T u ) j − l  E  Q l t =1 ( x T v t )  . W e require the follo wing claim for symmetric tensors where for an y p ermuta tion σ : [ m ] → [ m ]: E (Π m k =1 x i k ) = E  Π m k =1 x i σ ( k )  . Claim 6. If A is a symmetric or d er r tensor, then: max k v k = 1 A ( v , . . . , v ) = max k v 1 k =1 ,..., k v r k =1 A ( v 1 , . . . , v r ) Using Claim 6 : max k u k =1 ˆ M ( u, . . . , u ) = max k u 1 k =1 ,..., k u j − l k =1 ˆ M ( u 1 , . . . , u j − l ) In particular, there exists { u i } suc h that E j − l Y i =1 ( x T u i ) l Y t =1 ( x T v t ) ! > E j − l Y i =1 ( x T u i ) ! E l Y t =1 ( x T v t ) ! if and only if there exists u such that E ( x T u ) j − l l Y t =1 ( x T v t ) ! > E  ( x T u ) j − l  E l Y t =1 ( x T v t ) ! But at th e en d of the algorithm, we know that there are no suc h u , hence there can b e no s uc h u i either. Thus, we can factor an y u / ∈ S ′ through the exp ectations w hic h con tain only v i from S ′ . 3.1 Generalized ICA W e can no w giv e an algorithm for generalized ICA, assuming access to exact m omen t tensors and lo cal optima. If F is factorizable, Algorithm F actor will p ro vide a f actoring into subspaces suc h that the marginal distributions lo ok indep endent up to m m oments. The outp ut of FindBasis is a set of v ectors th at eac h lie in V or in W . W e try all p ossib ilities for the su bset from V , th en extend this using E xtendBasis , consider th e resulting decomp osition and tak e th e option that giv es a pro du ct factorization. Th e factorization found will b e U, U ⊥ for some U ⊆ V . Theorem 5 (F actoring, general n oise) . F or any ǫ, δ > 0 , given the moment tensors of distribution F over R n and the ability to c o mpute exact lo c al optima, if ther e e xi sts a subsp ac e V with dim ( V ) = k such that for j = 1 , . . . , m d j ( F , F V F W ) = 0 , Algorith m F actor ﬁnds a subsp ac e U suc h that d ( F, F U F U ⊥ ) = 0 with pr ob ability at le ast 1 − δ . The time and sample c omp lexity of the algorithm ar e b ounde d by n O ( k + m ) . Pr o of of The or em 5. In the en um eration of all subsets of size at most k subsets of B at line 2, we encoun ter T = B ∩ V . By Theorem 4, the output S ′ of E xtendBasis con tains only v ectors in V 11 Algorithm 3 F actor Input: Highest momen t m , distribution F . 1: B ← F indB asis ( m, F ) . 2: for every su bset T ⊆ B of at most k vecto rs do 3: S ′ ← E xtendB asis ( m, F , T , B − T ) . 4: T ′ ← E xtendB asis ( m, F , B − T , S ′ ) . 5: if | S ′ | > k or | T ′ | > n − k then 6: Con tinue w ith the n ext c hoice of T . 7: Augment S ′ with k − | S ′ | orth onormal v ectors from R n − span ( T ′ ), forming basis U . 8: Comp ute m momen ts of F , F U and F U ⊥ : 9: for l ≤ m do 10: Compute: ∆ l U = X ( i 1 ,...,i l )  E F ( x i 1 · · · x i l ) − E F U  x p 1 · · · x p j  E F U ⊥  x p j +1 · · · x p l   2 where { x p 1 , . . . , x p k } co r r esp ond to coordinates in U , and { x p k +1 , . . . , x p l } co r r esp ond to co ordinates in the U ⊥ subspace. 11: return U with lo west ∆ 3 U , breaking ties by consider in g ∆ 4 U , ∆ 5 U , . . . and T ′ con tains only vecto rs from W . By Part 2 of T heorem 4, the follo wing holds for an y c h oice of v ector { u i } ⊂ S n − 1 − span ( S ′ , T ′ ), w e ha ve E  ( x t u i ) j  = γ m for j ≤ m and that: E j − l Y i =1 ( x T u i ) l Y t =1 ( x T v t ) ! = E j − l Y i =1 ( x T u i ) ! E l Y t =1 ( x T v t ) ! for v t ∈ S ′ ∪ T ′ . In p articular, ev ery suc h u is indep endent from S ′ and T ′ up to the m th momen t. In the augmented b asis, the exp ectations separate in to the p ro du cts o ver the t w o subsp aces: E ( x i 1 . . . x i l ) = E  x p 1 . . . x p j  E  x p j +1 . . . x p l  where { x p 1 , . . . , x p j } corresp ond to coord inates in the U subspace, and { x p j +1 , . . . , x p l } corresp ond to co ord inates in the U ⊥ subspace. In particular, the en tries of the moment tensor of F are equal to the entries of the momen t tensor of F U F U ⊥ , an d hence will return ∆ j = 0. 4 Gaussian noise mo del W e no w giv e a complete algorithm assum in g F W is a Gaussian, assumin g we only hav e access to F through samples (n ot exact moment tensors). The main d iﬃcult y is handling the error accum u lation o v er multi p le ite r ations, as i n eac h round we can only hop e to appro ximately compute moments and ﬁ nd appr o ximate local optima. The idea is that FindBasis a n d ExtendBasis ﬁ nd vec tors where E  ( x T u ) m  6 = γ m . If F W is Gaussian, our algorithms only ﬁn d directions in V . Th u s, the error accum ulates o v er only k steps, and the total err or dep ends on k rather than n . 12 4.1 Lo cal search T o compute app ro ximate lo cal optima, w e p er f orm gradien t ascen t, mo ving in the d irection of the gradien t. If mo ving along the gradien t do es not in crease the function v alue by a certain v alue, w e switc h to second-order mo ve s based on the Hessian. W e will use th e notation that D g u is the gradien t of g at u and D 2 g u for the Hessian matrix. Th e top eig env a lue of a matrix on a su bspace orthogonal to a v ector can b e computed via a co ord inate transform ation. W e d enote M = k M m k 2 . Lo calOpt terminates in p olynomial time w hen the p arameters ǫ 1 , r 1 , ǫ 2 and r 2 , the thresholds and Algorithm 4 Lo calOpt Input: F unction g , error parameter ǫ 1 , 1: u ← un iformly at r andom ov er unit sphere. 2: while |h u, D g u i| ≤ (1 − ǫ 1 ) k D g u k or λ max ( D 2 g u ) ≥ ǫ 2 on u ⊥ do 3: if |h u, D g u i| ≤ (1 − ǫ 1 ) k D g u k then 4: Direction v ← π u ⊥ ( D g u ). 5: u ← u + r 1 v / k v k . 6: Renormalize u ← u/ k u k . 7: else if λ max ( D 2 g u ) ≥ ǫ 2 on u ⊥ then 8: Direction v ← top eige nv ector of D 2 g u on u ⊥ . 9: u ← u − r 2 v / k v k . 10: Renormalize u ← u/ k u k . 11: return u . step sizes for the ﬁrs t-order mo v es and seco n d-order mo v es are c hosen appropriately . Note that ǫ 2 v aries w ith the fun ction v a lu e, but the remaining parameters are ﬁxed. Lemma 7 (Lo cal searc h termination) . L et g ( u ) satisfy g ( tu ) = t m g ( u ) for some inte ger m . Supp ose that for our starting p oint u that g ( u ) ≥ η > 0 . Cho os e the p ar ameters as fol lows: ǫ 1 ≤  81 m ( m − 1) 2 η 2 1048 M  2 r 1 ≤ √ ǫ 1 4 m 2 M ǫ 2 = 3 m ( m − 1) g ( u ) 4 r 2 ≤ 9 η 256( m − 2) M wher e M is an upp er b ound for g on the unit spher e. Then L o c alOpt wil l terminate in at mo st p oly( M , m, 1 / ǫ 1 , 1 /r 1 , 1 /r 2 , 1 /η ) iter ations. Pr o of of L emma 7. Consider an it eration of the algorithm where the ﬁr st deriv ati ve co n dition is unsatisﬁed, an d we mak e a step of size r 1 in the direction of v / k v k (call the step h ). Th e new function v alue at this p oint u + h is giv en by the T a ylor series expansion with error (where ζ lies b et ween u and u + h ): g ( u + h ) = g ( u ) + D g u · h + 1 2 h T ( D 2 g ζ ) h 13 The increase in fun ction v alue is lo w er b ounded as follo ws: D g u · h + 1 2 h T D 2 g ζ h ≥ r 1  D g u , v k v k  − 1 2 r 2 1 ( v / k v k ) T D 2 g ζ ( v / k v k ) ≥ r 1 √ ǫ 1 − 1 2 r 2 1 m 2 M ≥ r 1 √ ǫ 1 − 1 8 r 1 √ ǫ 1 ≥ 7 8 r 1 √ ǫ 1 Th u s, w e ha v e lo wer b ound ed the increase in t h e function v alue. When we rescale u + h b ac k to norm 1, w e can apply the m -homogeneit y of f to deduce that: g  u + h k u + h k  = 1 k u + h k m/ 2 g ( u + h ) W e can compu te k u + h k = 1 + r 2 1 b ecause r 1 is p erp end icular to u . Hence: g  u + h k u + h k  = 1 (1 + r 2 1 ) m/ 2 g ( u + h ) ≥  1 − m + 2 2 r 2 1  g ( u + h ) ≥ g ( u )  1 + 7 8 g ( u ) r 1 √ ǫ 1   1 − m + 2 2 r 2 1  where w e used the estimate: (1 + x ) k ≥ 1 + ( k + 1 / 2) x for x ≤ 2 /k 2 . T o ﬁnish this calculation, w e simply substitute our v alue for r 1 in terms of ǫ 1 : g  u + h k u + h k  ≥ g ( u )  1 + 7 8 g ( u ) r 1 √ ǫ 1   1 − 1 8 M r 1 √ ǫ 1  ≥ g ( u )  1 + 5 8 M r 1 √ ǫ 1  Hence, there are at most at most a p olynomial n umber of iteratio n s of this form. Consid er no w an iterat ion wh er e the second deriv ativ e condition is uns atisﬁed (and the ﬁr st der iv ativ e condition m ust b e satisﬁed). W e tak e the same T a ylor series expansion with error term (to on e further term), where h = r 2 v / k v k : g ( u + h ) = g ( u ) + D g u · h + 1 2 h T D 2 g u h + 1 6 D 3 g ζ ( h, h, h ) W e w ill sho w that the con tributions from the ﬁrst and third deriv ativ e terms are small, and th at the second deriv ativ e term dominates. In the ﬁ rst deriv ati ve term, note that h is orthogonal to u , 14 and hence the comp onen t of D g u parallel to h has norm at most p 2 ǫ 1 − ǫ 2 1 k D g u k . W e estimate the other terms as b efore: D g u · h + 1 2 h T D 2 g u h + 1 6 D 3 g ζ ( h, h, h ) ≥ − p 2 ǫ 1 − ǫ 2 mM + 1 2 r 2 2 ǫ 2 − m ( m − 1)( m − 2) 6 r 3 2 M ≥ − 1 128 r 2 2 ǫ 2 + 1 2 r 2 2 ǫ 2 − 1 128 r 2 2 ǫ 2 ≥ 31 64 r 2 2 ǫ 2 Once aga in, w e ha ve to rescale bac k to norm 1. In this case: g  u + h k u + h k  ≥ g ( u )  1 + 31 64 g ( u ) r 2 2 ǫ 2 − m + 1 2 r 2 2  ≥ g ( u )  1 + 93 256 m ( m − 1) r 2 2 − m + 1 2 r 2 2  ≥ g ( u )  1 + 93 256 r 2 2  The last b ound follo ws b ecause the w orst p ossible lo w er b ound o ccurs at m = 3. Hence, there are only a p olynomial n umb er of iterations of this form as w ell. 4.2 Exact momen ts, appro ximate lo cal optima W e are no w ready to extend the analysis of Theorem 3 to the case when we h a v e acce ss to the exact moment tensor, b ut instead of using exact moments, we will u s e Lo calOpt with appropr iately c hosen ǫ 1 . On the o ther hand , using a w eak er lo cal optim u m al gorithm w ill also gi ve us weak er guaran tees o n the qualit y of the output, giving a wea ker f orm of Lemma 4. Ove r R n (instead of S n − 1 ), Lemma 3 giv es us a form ula of the f orm : f m ( u ) = k u V k m  E  ( x T V u 0 V ) m  − γ m  + k u W k m  E  ( x T W u 0 W ) m  − γ m  + γ m k u k m W e will optimise the fu nction g ( u ) = f m ( u ) − γ m k u k m using Lo calOpt o ve r the unit spher e. This is equiv alent to optimising f m and sim p liﬁes our d eriv ativ e calculati ons . Lemma 8 (Exact momen ts, inexact optima) . L et F = F V F W have the same ﬁrst m − 1 moments as a Gaussian but a diﬀer ent m th moment. Supp ose we run L o c alOpt on g ( u ) , starting fr om a p oint u wher e g ( u ) ≥ η , setting ǫ 1 ≤ mη 2 / ( m − 2) / M 2 / ( m − 2) . After p oly( n, 1 /ǫ 1 , η ) iter ations, we wil l have a p o int u ∗ wher e either k π V ( u ∗ ) k ≥ 1 − 16 ǫ 1 or k π W ( u ∗ ) k ≥ 1 − 16 ǫ 1 . Pr o of. W e pr o ceed as in Lemma 4. u ∗ lies on a curve C = { s ( u ∗ V ) 0 + t ( u ∗ W ) 0 : s 2 + t 2 = 1 , s ≥ 0 , t ≥ 0 } . W e will s ho w that neither s nor t is b ounded aw a y from 0 and 1. Restricted to the curve g ( u ) = g ( s, t ) = a v s m + a w t m . S upp ose that a w ≤ 0, then w e must h a v e that s ≥ ( η / M ) 1 /m . In this case, a dir ect calculati on comparin g h D g u , u i = ma v s m − 1 + ma w t m − 1 with k Dg u k = m p a 2 v s 2( m − 1) + a 2 w t 2( m − 1) will yield s ≥ 1 − 2 ǫ 1 . T h u s, we may assume that b oth a v and a w are p ositiv e, and that a v ≥ a w . Supp ose that for a unit v ector u , we hav e s, t ≥ 16 √ ǫ 1 , an d the ﬁrst-order gradien t condition: h D g u , u i k D g u k ≥ 1 − ǫ 1 , 15 then, λ max ( D 2 g u ) ≥ 3 m ( m − 1) g ( u ) 4 (where th e eigen v alue is tak en only in the su b space orthogonal to u ). Thus, the algorithm con tin u es making progress at suc h a vecto r u . T o d o this, w e low er b oun d the top eigen v alue by the quadr atic form in the dir ection − tu 0 V + su 0 W , whic h is orthogo n al to u . λ max ( D 2 g u ) ≥ ( − tu 0 V + su 0 W ) T D 2 g u ( − tu 0 V + su 0 W ) = m ( m − 1)( a v s m − 2 t 2 + a w s 2 t m − 2 ) = m ( m − 1)( a v s m − 1 , a w t m − 1 )  t 2 /s s 2 /t  By co n struction, u has t wo nonzero co ordinates, taking v a lu es s and t and all other co ord inates zero. D g u has partial d eriv ativ es ma v s m − 1 and ma w t m − 1 in these directions. Thus, h D g u , u i k D g u k ≤  a v s m − 1 a w t m − 1  T  s t       a v s m − 1 a w t m − 1      Th u s we obtain the condition that:  a v s m − 1 a w t m − 1  = (1 − ǫ )      a v s m − 1 a w t m − 1       s t  + p 2 ǫ − ǫ 2      a v s m − 1 a w t m − 1      r where 0 ≤ ǫ ≤ ǫ 1 and r is a unit vecto r orthogonal to ( s , t ). Su bstituting this in to the pr evious equation: λ max ( D 2 g u ) ≥ m ( m − 1)  (1 − ǫ )      a v s m − 1 a w t m − 1      + p 2 ǫ − ǫ 2      a v s m − 1 a w t m − 1      r T  t 2 /s s 2 /t  ≥ m ( m − 1)      a v s m − 1 a w t m − 1       (1 − ǫ ) − p 2 ǫ − ǫ 2  1 s + 1 t  ≥ m ( m − 1)      a v s m − 1 a w t m − 1       (1 − ǫ ) − 2 p 2 ǫ − ǫ 2 1 16 √ ǫ  ≥ 3 m ( m − 1) g ( u ) 4 where the last estimate follo ws from the Cauc hy-Sc h wa r tz in equalit y . 4.3 Appro ximate momen ts and appro ximate lo cal optima By using the r obust Sc hw artz Zipp el Lemma (Lemma 12) instead of the usual form, and Lo calOpt at Lines 10 and 11 of FindBasis and Lines 13 and 15 of ExtendBasis , w e can obtain an eﬃcient randomized algorithm. The ma jor diﬃcult y remaining i s that w e m ust b ound the err or incurred ev ery time w e call LocalOpt . The error analysis is tec hnical: the idea is to obtai n appro ximate v ersions of Lemmas 3 and 4 , and to sh o w that Lo calOpt b eha ve s well on these app ro ximate v ersions. Consid er the ﬁrs t iteration: 16 Lemma 9 (Two steps) . L et x have the same ﬁrst m − 1 moments as a Gaussian but a diﬀer ent m th moment. L et u 1 = √ 1 − δ v 1 − √ δ w 1 b e the ve c tor found in the ﬁrst iter ation of FindBasis , wher e v 1 and w 1 ar e unit ve ctors in V and W r esp e ctive ly. Supp ose we r un L o c alOpt on g ( u ) = f m ( u ) − γ m k u k m on the orth o gonal subsp ac e u ⊥ 1 , star ting fr om a p oint u wh er e g ( u ) ≥ η = M δ 1 / 16 , setting ǫ 1 ≤ mη 2 / ( m − 2) M 2 / ( m − 2) − 60 m 2 M 2 δ 5 / 16 as the err or p ar ameter in L o c alOpt . After p oly( n, 1 / ǫ 1 , η ) iter ation s, we wil l have a p oint u ∗ wher e either k π V ( u ∗ ) k ≥ 1 − δ 1 / 8 or k π W ( u ∗ ) k ≥ 1 − δ 1 / 8 Tthe sequence of ideas in this pro of is not unlike the pro ofs in Section 3: ﬁr s t we derive a nice representati on of f m (cf Lemma 3 , then w e analyse the sup p ort of a local optim um u nder this represent ation (cf Lemma 4) – w e are not able to claim th at the local optim um found is con tained wholly in V or W , but since we are satisﬁed with appro ximate lo cal optima, we can b ound the comp onent s around 0 and 1. All through this, w e must b ound the error, and try to pu sh through the calculat ions of Lemma 8 . Pr o of of L emma 9. First, we will co ns truct an orthonormal basis wh ic h includes u 1 : extend { v 1 } and { w 1 } to orthonormal bases { v i } and { w i } of V and W resp ectiv ely . Replace v 1 and w 1 with the follo wing t w o vect ors: u 1 = √ 1 − δ v 1 − √ δ w 1 ˆ u 1 = √ δ v 1 + √ 1 − δ w 1 Th u s our basis will b e { u 1 , ˆ u 1 , v 2 , . . . , v k , w 2 , . . . , w l } . F or a vec tor x = ( x 1 , . . . , x n ) in the basis of { v i } and { w i } , w e no w ha ve : x = ( √ 1 − δ x 1 − √ δ x 2 , √ δ x 1 + √ 1 − δ x 2 , x 3 , . . . , x n ) whic h is simply a rotation (unitary transformation) in the ﬁ rst t w o co ordinates. W e ev a lu ate the m th momen t on the subsp ace orth ogonal to u 1 . Let ξ b e a p oint on this orthogonal subspace: note that ξ has 0 comp onen t in the ﬁrst co ord inate: f m ( ξ ) = E  ( x T ξ ) m  = E √ δ x 1 ξ 2 + √ 1 − δ x 2 ξ 2 + k X i =2 x v i ξ v i + l X i =2 x w i ξ w i ! m ! W e can b r eak up the argument in to tw o d ot pr o ducts, wh ic h are indep endent of eac h other. More- o v er, observ e that the norm of th e tw o constituen t p arts of th e ξ v ector tak en together is still 1. x T ξ = ( x 1 , x v 2 , . . . , x v k ) T ( √ δ ξ 2 , ξ v 2 , . . . , ξ v k ) + ( x 2 , x w 2 , . . . , x w l ) T ( √ 1 − δ ξ 2 , ξ w 2 , . . . , ξ w l ) Hence, w e can apply Lemma 3: this giv es a p er tu rb ed v ersion of L emm a 3. f m ( ξ ) = δ ξ 2 2 + k X i =2 ξ 2 v i ! m/ 2 E  (( x 1 , x v 2 , . . . , x v k ) T ( √ δ ξ 2 , ξ v 2 , . . . , ξ v k ) 0 ) m − γ m  + (1 − δ ) ξ 2 2 + l X i =2 ξ 2 w i ! m/ 2 E  ( x 2 , x w 2 , . . . , x w l ) T ( √ 1 − δ ξ 2 , ξ w 2 , . . . , ξ w l ) 0  m − γ m  + γ m 17 Fixing a p oin t ξ ∗ ∈ u ⊥ 1 ∩ S n − 1 : w e will giv e a curv e C which p asses through this p oin t and remains on th e un it sp here. W e will analyse the v a lu e of the g ( ξ ) = f m ( ξ ) − γ m k ξ k m on this curv e – as b efore, ev ery p oin t w hic h is a lo cal optim um on S n − 1 has to b e a local optim um on C as w ell. Th u s b y studying the lo cal optima o v er C , we w ill b e able to describ e the strucure of the local optima in fu ll space. W e ma y assume that all the ξ ∗ i are n on n egativ e – otherwise we can pic k simp ly negate th e asso ciated basis v ector. W e tak e the follo wing as the components for C : ξ ∗ v = 1 q P k i =2 ( ξ ∗ v i ) 2 (0 , 0 , ξ ∗ v 2 , . . . , ξ ∗ v k , 0 , . . . , 0) ξ ∗ w = 1 q (1 − δ )( ξ ∗ 2 ) 2 + P l i =2 ( ξ ∗ w i ) 2 (0 , √ 1 − δ ξ ∗ 2 , 0 , . . . , 0 , ξ ∗ w 2 , . . . , ξ ∗ v l ) ξ ∗ 1 = (1 , 0 , . . . , 0) Since these are the only three dir ections that c h ange along C , w e will use th ese three v ectors as an orthonormal basis. Now, deﬁning the follo wing quantit y: α = ( ξ ∗ 2 ) 2 / (1 − δ )( ξ ∗ 2 ) 2 + l X i =2 ( ξ ∗ w i ) 2 ! w e can write our cur ve C as: C = { y ξ ∗ v + z ξ ∗ w + √ αδ z ξ ∗ 1 : y 2 + (1 + δ α ) z 2 = 1 , y , z ≥ 0 } Sp eciﬁcally , w e will use the b asis ξ ∗ v and (1 + αδ ) − 1 ( ξ ∗ w + ξ ∗ 1 ). Note that in this basis, y is p recisely the co ord inate along the ﬁrst basis v ector and (1 + δ α ) 1 / 2 z is th e co ord inate along the second basis v ector. Denote this latter quantit y by z ′ , then by the chain rule, w e ha v e that ∂ /∂ z ′ = (1 + δ α ) − 1 / 2 ∂ /∂ z . Restricted to C , the exp ectation terms simplify: note that ( √ 1 − δ ξ 2 , ξ w 2 , . . . , ξ w l ) 0 remains constan t on C , so the seco n d exp ectation term r educes to a constan t, whic h we will denote with a w . The ﬁrs t exp ectation term d o es not remain constan t, b ecause there is an additional comp onent in the direction of v 1 , but this comp onent alwa ys has a small magnitud e. With a c hange of basis, w e can simplify this exp ression to inv olving only y and z : E  (( x 1 , x v 2 , . . . , x v k ) T ( √ δ ξ 2 , ξ v 2 , . . . , ξ v k ) 0 ) m  = E  ( x 1 , x ξ ∗ v ) T ( √ αδ z , y )  m  W e will denote the ﬁrs t exp ectatio n term b y a v . In fu ll, our ob jectiv e fu nction on C is giv en b y: g ( ξ ) = [ δ αz 2 + y 2 ] ( m/ 2) E  ( x 1 , x v ) T ( √ αδ z , y ) 0  m − γ m  + a w z m = [ δ αz 2 + y 2 ] ( m/ 2) a v ( y , z ) + a w z m Next w e will examine the lo cal optima on C : let ξ b e the output of of Lo calOpt : w e w ill sho w that ξ h as large pro jection with either the V or W sub space (cf Lemma 4). W e will a n alyse the follo wing cases: 1. y 2 ≤ δ 1 / 4 or z 2 ≤ δ 1 / 4 . 18 2. y 2 ≥ δ 1 / 4 and z 2 ≥ 1 / 3. 3. z 2 ≥ δ 1 / 4 and y 2 ≥ 1 / 3. Case 1: Su pp ose that y 2 ≤ δ 1 / 4 , then we must hav e z ≥ p 1 − δ 1 / 4 − αδ . Th e appro ximate lo cal optim um u that we compu te h as pro jection at least √ 1 − δ on this lo cal optim um , and hen ce, the pro jection of u onto w is at least: k π W ( u ) k ≥ q (1 − δ )(1 − δ 1 / 4 − αδ ) − √ δ ≥ 1 − δ / 2 − δ 1 / 4 / 2 − αδ − √ δ ≥ 1 − δ 1 / 4 In this case, for suﬃcient ly small δ , we ha ve: k π W ( u ) k 2 ≥ 1 − δ 1 / 8 The argumen t for when z 2 ≤ δ 1 / 4 is iden tical. Case 2: W e will p ro ve that Lo calOpt can not terminate in this region by carrying out the calculatio n s of L emma 8 whilst k eeping trac k o f errors. Th u s, let ξ b e a p oin t in this range, w e will s ho w that if the ﬁrst deriv at ive condition in Lo calOpt is satisﬁed, then the seco n d deriv ativ e condition is uns atisﬁed, thus Lo calOpt can not termin ate at ξ . First, let us examine h o w f c hanges o v er C : Claim 10 ( First partials und er p erturb ations) . In the r ange wher e y 2 ≥ δ 1 / 4 and z 2 ≥ 1 / 3 ,     ∂ g ∂ y − ma v y m − 1     ≤ 3 mM √ δ     ∂ g ∂ z − ma w z m − 1     ≤ 4 mM √ δ As a corollary , via the triangle inequalit y , we hav e that: k ( g y , g z ) k ≥ m   ( a v y m − 1 , a w z m − 1 )   − 5 mM √ δ Claim 11 ( S econd partials u nder p erturbations) . In the r ange wher e y 2 ≥ δ 1 / 4 and z 2 ≥ 1 / 3 :     ∂ 2 g ∂ y 2 − m ( m − 1) a v y m − 2     ≤ c vv m 2 M √ δ     ∂ 2 g ∂ z 2 − m ( m − 1) a w z m − 2     ≤ c w w m 2 M √ δ     ∂ 2 g ∂ y ∂ z     ≤ c vw m 2 M √ δ wher e c vv , c vw and c w w ar e absolute c onstants b ounde d by 20. Throughout the rest of this calculation, w e will use the basis of n − 1 v ectors consisting of { ξ ∗ v , (1 + αδ ) − 1 ( ξ ∗ w + ξ ∗ 1 ) } , and an y orthonormal extension to u ⊥ 1 . In particular, in this basis, ξ = ( y , z ′ , 0 , . . . , 0). 19 As b efore, we w ill low er b ound th e contribution of the second deriv ativ e term. Our direction of mo v ement will b e ( − z ′ , y , 0 , . . . , 0). This vec tor is clearly a unit v ector orthogonal to ξ . where top eigen v alue is tak en orthogonal to ξ . λ max ( D 2 g ξ ) ≥ ( − z ′ , y ) D 2 g ξ  − z ′ y  ≥ ( − √ 1 + αδ z , y )  g y y g z ′ y g z ′ y g z ′ z ′   − √ 1 + δ αz y  W e can f urther use C laim 11 to simplify th e other comp onen ts of the quadratic form: λ max ( D 2 g ξ ) ≥ (1 + δ α ) z 2 g y y + y 2 g z ′ z ′ − 2 c vw m 2 M √ δ ≥ (1 + δ α ) m ( m − 1)( a v y m − 1 , a w z m − 1 )  z 2 /y y 2 /z  − (1 + δ α )( c z z + c y y + 2 c z w ) m 2 M √ δ Our ﬁrst deriv a tive co n dition is gi ven b y: h D g ξ , ξ i k D g ξ k ≥ 1 − ǫ 1 Since ξ = ( y , z ′ , 0 , . . . , 0) h as only t wo n onzero comp on ents, w e need only ev aluate tw o comp onents of the deriv ativ e: furtherm ore, we can lo w er b ound the norm k D g ξ k ≥ k ( g y , g z ′ ) k , which gives the follo wing lo wer b ound : ( g y , g z ′ )  y z ′  k ( g y , g z ′ ) k ≥ 1 − ǫ 1 Rearranging, and ap p lying Claim 10 yields: m ( a v y m − 1 , a w z m − 1 )  y z  ≥ (1 − ǫ 1 ) k ( g y , g z ′ ) k − 7 mM √ δ ≥ m (1 − ǫ 1 )   ( a v y m − 1 , a w z m − 1 )   − 12 mM √ δ ≥ m (1 − ǫ 1 − 12 M √ δ η )   ( a v y m − 1 , a w z m − 1 )   Th u s, w e ca n rewrite this relationship for u nit v ector r orthogonal to ( a v y m − 1 , a w z m − 1 ) and 0 ≤ ǫ ≤ ǫ 1 + 12 M √ δ η :  a v y m − 1 a w z m − 1  = (1 − ǫ )   ( a v y m − 1 , a w z m − 1 )    y z  + p 2 ǫ − ǫ 2   ( a v y m − 1 , a w z m − 1 )   r Substituting this bac k in to our lo w er b ound f or λ max yields: λ max ≥ (1 + δ α )(1 − ǫ )   ( a v y m − 1 , a w z m − 1 )   ( z 2 + y 2 ) − p 2 ǫ − ǫ 2   ( a v y m − 1 , a w z m − 1 )    1 y + 1 z  − 80 m 2 M √ δ ≥ (1 + δ α )(1 − δ 1 / 6 ) m ( m − 1) f ( ξ ) − √ 2 δ 1 / 24 − 80 m 2 M √ δ ≥ 3 4 m ( m − 1) f ( ξ ) 20 where w e used the Cau ch y-Sc hw artz inequalit y for:   ( a v y m − 1 , a w z m − 1)   ≥ a v y m + a w z m ≥ g ( ξ ) − mM √ αδ Case 3: Th is case follo ws fr om the exactly the same analysis a s ab o ve. It i s i n fac t substan tially easier, as the denominator terms αδz + y are in fact all b oun ded by constants n o w, and hence the n u m erator is small enough in almost all cases ab o ve to b ound th e terms. W e no w provide the proofs for the clai ms regarding the coeﬃcien ts a v and a w . In explicitly taking deriv at ives, it is imp ortan t to note the follo wing: a v = E  ( x 1 , x v ) T ( √ αδ z , y ) 0  m  − γ m = 1 ( αδ z 2 + y 2 ) ( m/ 2) E  ( x 1 , x v ) T ( √ αδ z , y )  m  − γ m F or ease of notation, denote φ = ( √ αδ z , y ), we will supp r ess all but one φ argumen t in our moment tensors, th us we will write A ( φ ) instead of A ( φ, . . . , φ ), and A ( φ, e 1 ) instead of A ( φ, . . . , φ, e 1 ). If A is a m th order tensor, its deriv ativ e has comp onents giv en by ( D ˆ A φ ) i = mA ( φ, e i ) where A tak es ( m − 1) copies of φ . W e also hav e the Hessian D 2 : ( D 2 A φ ) ij = m ( m − 1) A ( φ, e i , e j ). W e can b oun d the sp ectral norm of D 2 A u sing Claim 6, wh ic h yields λ max ( D 2 A ) ≤ m ( m − 1) M . Pr o of of Claim 10. Firstly , w e hav e: ∂ g ∂ y = my ( δ αz 2 + y 2 ) ( m/ 2) − 1 a v + ( δ αz 2 + y 2 ) ( m/ 2) ∂ a v ∂ y ∂ g ∂ z = mz m − 1 a w + mαδ z ( δ αz 2 + y 2 ) ( m/ 2) − 1 a v + ( δ αz 2 + y 2 ) ( m/ 2) ∂ a v ∂ z The mαδ z ( δ αz 2 + y 2 ) ( m/ 2) − 1 a v is up p er b ounded in absolute v a lu e in mM δ . Similarly , it is also clear that:    my ( δ αz 2 + y 2 ) ( m/ 2) − 1 a v − ma v y m − 1    ≤ mM √ δ Th u s it remains to sho w that the p artial deriv a tive terms are small: ( δ αz 2 + y 2 ) ( m/ 2) ∂ a v ∂ y = ( δ αz 2 + y 2 ) ( m/ 2)  − my ( δ αz 2 + y 2 ) ( m/ 2)+1 A ( φ, φ ) + m ( δ αz 2 + y 2 ) ( m/ 2) A ( φ, e 1 )  = m − y √ αδ z A ( φ, e 2 ) − y 2 A ( φ, e 1 ) αδ z 2 + y 2 + A ( φ, e 1 ) ! = m − y √ αδ z A ( φ, e 2 ) + αδ z 2 A ( φ, e 1 ) αδ z 2 + y 2 When w e ha v e a term lik e A ( φ, . . . , φ, e 1 ), the argumen ts are n ot normalised. In particular: A ( φ, . . . , φ, e 1 ) = ( δ αz 2 + y 2 ) ( m − 1) / 2 A ( φ 0 , . . . , φ 0 , e 1 ) Th u s, normalising giv es:     ( δ αz 2 + y 2 ) ( m/ 2) ∂ a v ∂ y     ≤ mM √ δ + mδ M ≤ 2 mM √ δ 21 F or the other partial deriv ativ e, we wan t to compu te: ( δ αz 2 + y 2 ) ( m/ 2) ∂ a v ∂ z = ( δ αz 2 + y 2 ) ( m/ 2) " − mαδ z ( αδ z 2 + y 2 ) ( m/ 2)+1 A ( φ, φ ) + m √ αδ ( αδ z 2 + y 2 ) m/ 2 A ( φ, e 2 ) # = m √ αδ − √ αδ z A ( φ, φ ) + αδ z 2 A ( φ, e 2 ) + y 2 A ( φ, e 2 ) αδ z 2 + y 2 ! Applying the same metho d:     ( δ αz 2 + y 2 ) ( m/ 2) ∂ a v ∂ z     ≤ 3 mM √ δ Hence com bining this with our earlier b ound, we hav e the desired inequ alit y . Pr o of of Claim 11. By direct ca lculation, w e obtain: ∂ 2 g ∂ y 2 = ( δ αz 2 + y 2 ) ( m/ 2) ∂ 2 a v ∂ y 2 + 2 my ( δ αz 2 + y 2 ) ( m/ 2) − 1 ∂ a v ∂ y + ma v ( δ αz 2 + y 2 ) ( m/ 2) − 2 ( δ αz 2 + ( m − 1) y 2 ) W e no w estimate the three terms in this sum – the ﬁrst tw o terms will b e of order √ δ , and the last term will give us app r o ximately m ( m − 1) a v y m − 2 . ( δ αz 2 + y 2 ) ( m/ 2) ∂ 2 a v ∂ y 2 = ( δ αz 2 + y 2 ) ( m/ 2) ( − m 2 y ( αδ z 2 + y 2 ) ( m/ 2)+1 " − y √ αδ z A ( φ, e 2 ) αδ z 2 + y 2 + αδ z 2 A ( φ, e 1 ) αδ z 2 + y 2 # + m ( αδ z 2 + y 2 ) m/ 2 " − √ αδ z A ( φ, e 2 ) αδ z 2 + y 2 + − ( m − 1) y √ αδ z A ( e 2 , e 1 ) αδ z 2 + y 2 + y 2 √ αδ z A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2 + αδ z 2 ( m − 1) A ( e 1 , e 1 ) αδ z 2 + y 2 + − 2 y αδ z 2 A ( φ, e 1 ) ( αδ z 2 + y 2 ) 2 #) = ( − m 2 y ) " − y √ αδ z A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2 + αδ z 2 A ( φ, e 1 ) ( αδ z 2 + y 2 ) 2 # + m " − √ αδ z A ( φ, e 2 ) αδ z 2 + y 2 + − ( m − 1) y √ αδ z A ( e 2 , e 1 ) αδ z 2 + y 2 + y 2 √ αδ z A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2 + αδ z 2 ( m − 1) A ( e 1 , e 1 ) αδ z 2 + y 2 + − 2 y αδ z 2 A ( φ, e 1 ) ( αδ z 2 + y 2 ) 2 # W e will b ou n d the magnitude of ev ery term in th is sum. Con s ider the ﬁr st term of the form:      ( − m 2 y ) − y √ αδ z A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2      ≤ m 2      y 2 √ δ A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2      Th u s, since m ≥ 3:      ( − m 2 y ) − y √ αδ z A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2      ≤ 3 m 2 M      y 2 √ δ αδ z 2 + y 2      No w, y 2 / ( αδz 2 + y 2 ) ≤ 1, h ence:      ( − m 2 y ) − y √ αδ z A ( φ, e 2 ) ( αδ z 2 + y 2 ) 2      ≤ 3 m 2 M √ δ 22 Of the seven terms in the su m, the ﬁrst, third and ﬁ fth terms can b e analysed exactly a s ab o ve, and their sum can b e up p er b oun ded b y 15 m 2 M √ δ . F or the r emainin g terms we ha ve to use our lo w er b ound on y , for example:     ( − my ) αδ z 2 A ( φ, e 1 ) ( αδ z 2 + y 2 ) 2     ≤ mM     δ y αδ z 2 + y 2     ≤ mM     δ y     ≤ mM δ 7 / 8 By this reasoning, w e can b oun d all seven terms by m 2 M √ δ , hence this term in ∂ 2 g /∂ y 2 con tributes is b ound ed in absolute v alue by 7 m 2 M √ δ . F or th e second term in that expression, the analysis is almost iden tical to the p r evious cla im and gives     2 my ( δ αz 2 + y 2 ) ( m/ 2) − 1 ∂ a v ∂ y     = 2 m 2      y ( δ αz 2 + y 2 ) ( m/ 2) − 1 ( − y √ αδ z A ( φ 0 , e 2 ) + αδ z 2 A ( φ 0 , e 1 )) ( δ αz 2 + y 2 ) 3 / 2      ≤ 2 m 2      y ( − y √ αδ z A ( φ 0 , e 2 ) + αδ z 2 A ( φ 0 , e 1 )) ( δ αz 2 + y 2 )      ≤ 2 m 2 M √ δ + 2 m 2 M     δ y     ≤ 4 m 2 M √ δ Th u s, w e ha v e:     ∂ 2 g ∂ y 2 − ma v ( δ αz 2 + y 2 ) ( m/ 2) − 2 ( δ αz 2 + ( m − 1) y 2 )     ≤ 19 m 2 M √ δ By applying the triangle inequalit y:    ma v ( δ αz 2 + y 2 ) ( m/ 2) − 2 ( δ αz 2 + ( m − 1) y 2 ) − m ( m − 1) a v y m − 2    ≤ m 2 M √ δ Th u s w e ha ve the desired result for the second partial with resp ect to y . Th e other second deriv ativ es are computed in a similar w a y . Using the ab o ve, we are no w examine what happ ens after t ite r ations of FindBasis . The follo wing th eorem sho ws that after k iterations of FindBasis , our err or blo ws u p at most doubly exp onentia lly in k . Th e pro of holds for ExtendBasis is as well. Theorem 6 (Multiple iterations) . Supp ose FindBasis ﬁnds j ≤ k ortho gonal ve ctors { u 1 , . . . , u j } of g ( u ) taking ǫ 1 such that η ≤ M ǫ 1 / 16 j 1 for e ach c al l of L o c a lOpt , then k π V ( u j ) k 2 ≥ 1 − ǫ (1 / 16) j 1 . Pr o of of The or em 6. After t iterations, we h a v e a basis of orthonormal v ectors { u 1 , . . . , u t } where eac h u i is clo se to s ome v ector in V : u 1 = a 11 v 1 + b 11 w 1 u 2 = a 21 v 1 + a 22 v 2 + b 21 w 1 + b 22 w 2 . . . . . . u t = a t 1 v 1 + · · · + a tt v t + b t 1 w 1 + · · · b tt w t 23 W e u se the orthonormal basis { u i } , { v t +1 , . . . , v k } , th e remainin g v ectors in W { w t +1 , . . . , w n − k } , and appro ximate copies of { w 1 , . . . , w t } . This last set is giv en b y: w ′ 1 = c 1 w 1 + t X i =1 d 1 i v i + t X i =1 e 1 i w i . . . . . . w ′ t = c t w t + t X i =1 d ti v i + t X i =1 e ti w i In these sums w e hav e d ii = e ii = 0, and we ha v e orthonormality b et ween these v ectors. Consider the inner p ro du ct x T ξ , where ξ is of unit length and lies in the sp ace orthogo n al to { u i } : x T ξ = k X i = t +1 x v i ξ v i + n − k X i = t +1 x w i ξ w i + t X i =1 ξ w ′ i ( c i x w i + i X j =1 d ij x v j + i X j =1 e ij x w j ) = ( x v 1 , . . . , x v t , x v t +1 , . . . , x v k ) T ( t X i =1 ξ w ′ i d i 1 , . . . , t X i =1 ξ w ′ i d it , ξ v t +1 , . . . , ξ v k )+ + ( x w 1 , . . . , x w t , x w t +1 , . . . , x w n − k ) T ( ξ w ′ 1 c 1 + t X i =1 ξ ′ w i e i 1 , ξ w ′ t c t + t X i =1 ξ w ′ i e it , ξ w t +1 , . . . , ξ w n − k ) The t w o vec tors formed from ξ ha ve tot al n orm 1. Now, we can apply Lemma 3, to obtain: f m ( ξ ′ ) =   k X j = t +1 ξ 2 v j + t X j =1 t X i =1 ξ w ′ i d ij ! 2   m/ 2 a v +   n − k X j = t +1 ξ 2 w j + t X j =1 ξ w ′ j c j + t X i =1 ξ w ′ i e ij ! 2   m/ 2 a w + γ m where the exp ectation term a v is giv en b y: a v = E " ( x v 1 , . . . , x v t , x v t +1 , . . . , x v k ) T ( t X i =1 ξ w ′ i d i 1 , . . . , t X i =1 ξ w ′ i d it , ξ v t +1 , . . . , ξ v k ) 0 # m − γ m ! (and similarly for a w ). As in the single iteration case, w e r estrict to a cur v e. Fix an output ξ ∗ of FindBasis : w e will ﬁx the ratio of the comp onents { ξ w j } in the ratio of ξ ∗ , an d similarly , we will ﬁx the ratios of { ξ w ′ 1 , . . . , ξ w ′ t , ξ w t +1 , . . . , ξ w n − k } according to ξ ∗ as w ell. This giv es the follo wing restriction on our curve after subtr acting γ m k ξ k m . g ( ξ ′ ) = a v   ( y ′ ) 2 + ( z ′ ) 2   t X j =1 t X i =1 d ij ξ ∗ w ′ i /l ! 2     m/ 2 + a w ( z ′ ) m where l is a constant given by: l = 1  P n − k j = t +1 ( ξ ∗ w j ) 2 + P t j =1  ξ ∗ w ′ j c j + P t i =1 ξ ∗ w ′ i e ij  2  24 The co eﬃcien t of z ′ 2 is b ounded b y at most 2 t ( ǫ 1 / 16 1 ) t , hence using th e previous lemma for a single iteration, the output p ro du ced h ere is a ( t + 1) th v ector u t +1 suc h that: h u t +1 , u ∗ i ≥ 1 −  2 t ( ǫ 1 / 16 1 ) t  1 / 8 ≥ 1 − ( ǫ 1 / 16 1 ) t +1 for suﬃcien tly small ǫ 1 (relativ e to k ). 4.4 Algorithms Using Lo calOpt , we hav e an algorithm for factoring (Problem 1). T o deal with the errors intro- duced by sampling and appro ximate local optima, we replace the Sc hw artz-Zipp el step in Find- Basis with the rob u st v ersion in Lemma 12, where we set the error parameter of th e robus t Sc hw artz-Zipp el lemma to b e ( η − k M m k 2 ǫ ) /n m . W e will use the follo wing r obust v ersion of the Sc hw artz-Zipp el ident ity test. Lemma 12 (Robu st Sc hw artz-Zipp el) . L et p b e a de gr e e m p ol ynomial over n variables and K a c onvex b o dy in R n . If ther e exists x ∈ K such that | p ( x ) | > ǫ (2 cn ) m , then for l r and om p oints s i , Pr ( ∀ s i : | p ( s i ) | ≤ ǫ ) ≤ 2 − l . 4.5 Robust Sc h wartz-Zippel lemma Pr o of of L emma 12. Let µ denote th e uniform measure o ve r K , b y Corollary 2 of Carb ery and W righ t [12]: max x ∈ K | p ( x ) | 1 /m ǫ − 1 /m µ ( { x ∈ K : | p ( x ) | ≤ ǫ } ) ≤ cn Consider our l samples – there are t wo p ossibilities: 1. µ ( { x ∈ K : | p ( x ) | ≤ ǫ } ) ≥ 1 / 2. In this case, we hav e | p ( x ) | ≤ ǫ (2 cn ) from the b ound ab o v e. 2. µ ( { x ∈ K : | p ( x ) | ≤ ǫ } ) ≤ 1 / 2. Then, Pr ( ∀ i | p ( x i ) | ≤ ǫ ) ≤ 1 / 2 l . W e can of course amplify this probabilit y b y rep eating the test (or simply taking l larger). Algorithm 5 F actorUnderGaussian Input: Highest momen t m , distribution F . 1: B ← F indB asis ( m, F ) . 2: U ← E xtendB asis ( m, F , B , φ ) . 3: return U Pr o of of The or em 1. W e c ho ose ǫ 1 (the ﬁrst step lo cal iteration) to b e: ( ǫ 1 ) ( 1 16 ) k ≤ min { ǫ, η − k M m k 2 ǫ } 25 where k M m k 2 is the 2-norm (sp ectral norm) of the m th momen t tensor. W e tak e enough samples so that eac h estimated moment in W is within min ( ǫ 1 , η − k M m k 2 ) /n m ) of the Gaussian moment, and ev ery momen t in V is oﬀ by at most min( ǫ 1 / 2 , η / 2). In particular, note that all sampled gradients and Hessian matrices tak e a v alue w hic h diﬀer b y no more th an ǫ 1 / 2 from their true v alues. Thus, w e can simply absorb this as part of the error arising from lo cal searc h. Also, this giv es us an upp er b ound on sample complexit y – the n umb er of samples it tak es to estimate the m th momen ts of a Gaussian distribution to accuracy ǫ in R n is giv en as C m ǫ − 2 n m/ 2 log n [22], whic h wh en ev a luated b ecomes n O ( m ) . A t eac h iterat ion of the algorithm, w e run th e Robust Sc hw artz-Zipp el test log ( k /δ ) times with Sc hw artz-Zipp el parameter η − k M m k 2 ǫ . With pr obabilit y at least 1 − δ , either eac h iteration pro du ces a u , where   E  ( x T u ) m  − γ m   ≥ η − k M m k 2 ǫ or we correctly deduce that there are n o more directions w hose moments d iﬀer from a Ga u ssian by more than ( η − k M m k 2 ǫ ) /n m . In the latter case, by momen t distinguish abilit y , ev ery ve ctor in P , the minimally r elev an t sub s pace, has large pro j ection on the basis { u i } . In the former case, w e know that eve r y unit v ector in { u i } ⊥ with pro jection at least 1 − ǫ tak es v alue w hic h is b ound ed a w ay from γ m b y at lea st η − k M m k 2 ǫ , th us ev ery such vec tor is still momen t distinguishable. Ap plying Theorem 6 then, w e sequ en tially generate a sequence of at most k orthogonal u i suc h that: |h u i , π V ( u i ) i| ≥ 1 − ( ǫ 1 ) (1 / 16) i W e need to sho w that in addition d m ( F , ˆ F U ˆ F U ⊥ ) ≤ ǫ . Let F ′ = ˆ F U ˆ F U ⊥ : the momen t-distance b et ween the true and sampled distributions diﬀer by at mo st ǫ 1 , it suﬃces for us to pro v e that d m ( F , F ′ ) ≤ ǫ . T o this end, w e w ill apply the repr esentati on formula to F ′ for some ﬁxed unit v ector u . As b efore, w e ha v e: E F ′  ( x T u ) m  = E F ′  ( x T u U ) m )  + E F ′  ( x T u U ⊥ ) m  − γ m k u U k m − γ m k u U ⊥ k m + γ m = E F  ( x T u U ) m )  + E F  ( x T u U ⊥ ) m  − γ m k u U k m − γ m k u U ⊥ k m + γ m Hence, comparing w ith a similar expression for E F  ( x T u ) m  :   E F ′  ( x T u ) m  − E F  ( x T u ) m    ≤   E F  ( x T u U ) m )  − E F  ( x T u V ) m )    + +   E F ′  ( x T u U ⊥ ) m  − E F  ( x T u V ⊥ ) m    + | γ m k u U k m − k u V k m | + γ m |k u U k m − k u U ⊥ k| No w, viewing E F  ( x T u ) m  as the tensor app lied to u , w e see th at w e can b ound these terms by the tensor s p ectral norm:   E F ′  ( x T u ) m  − E F  ( x T u ) m    ≤ ( k M m k 2 + mγ m ) k u U − u V k + ( k M m k 2 + mγ m ) k u U ⊥ − u V ⊥ k By c hoice of U , w e h a v e k u U − u V k ≤ ǫ , and similarly for th e othogo n al comp onen t, th u s w e ha ve our b ound. W e n o w apply these metho d s to learning the concept class H (Problem 2). After app lying an isotropic transform ation, F w ill ha ve Gaussian moments in eve r y direction orthogonal to V , and hence the output basis of FindBasis and ExtendBasis return s only vec tors in the V subspace. The p r o of of this algorithm is s traigh tforwa rd giv en the proof of the facto r ing algo rith m under Gaussian noise and our r ob u stness assumptions. W e will u s e the follo wing p rop osition on robu st learnabilit y (see e.g., [2]). 26 Algorithm 6 LearnUnderGaussian Input: Highest momen t m , distribution F . 1: B 1 ← F indB a sis ( m, F ) . 2: B 2 ← F indB a sis ( m, F + ) on the space orthogonal to B 1 . 3: Alternately run E xt endB a sis on F and F + to ﬁnd a basis U of size at most k . Extend this to a basis of dimension k . 4: Dra w suﬃcien t samples S to learn H on this k dimensional subspace. Pro ject S to span ( U ). 5: Learn H o v er U . Prop osition 7 (V C dimension) . L et H b e a hyp oth e si s class with VC dimension d . L et ℓ ∈ H b e a subsp ac e junta with r elevant subsp ac e V , wher e dim ( V ) = k . L et U b e a k d imensional sub sp ac e wher e ℓ ( π U ) lab els a 1 − ǫ fr action of p oints c orr e ctly. Then we c an le arn ℓ with sample c omplexity (1 /ǫ ) c 2 d log(1 /ǫ )+ c 2 log(2 /δ ) with pr ob ability at 1 − δ . Pr o of of The or em 2. H is r obust ; by assump tion there exists ǫ ′ whic h is p olynomial in ǫ su c h that ǫ ′ + g ( ǫ ′ ) ≤ ǫ/ 2. W e will tak e this ǫ ′ and w ill use the follo win g ǫ 1 for all our calls to Lo calOpt : ( ǫ 1 ) ( 1 16 ) k ≤ min { ǫ ′ , η − k M m k 2 ǫ } Under these parameters, the pro of for th e factoring steps of Lines 1-3 are as in F actorUnder- Gaussian . Thus w ith pr obabilit y at least 1 − δ we will o b tain an orthonormal basis { u i } where |h u i , π V ( u i ) i| ≥ 1 − ( ǫ 1 ) (1 / 16) i . By m omen t learnabilit y , the set of { u i } d isco v ered is appro ximately a basis for P , the minimal dimension relev an t subsp ace. By our c hoice of ǫ 1 ab o ve, we satisfy the robustn ess condition, i.e ., ǫ 16 k 1 ≤ ǫ ′ , in whic h case only ǫ/ 2 fract ion of the p oin ts are mislab eled o v er sp an ( { u i } ). Finally , w e allo w the remaining ǫ/ 2 error to the learning algorithm, to obtain an outpu t h yp othesis whic h correctly lab els 1 − ǫ f r action of F . 5 Momen t distance In our a lgorithms, w e terminate if all remaining directions are Gaussian in the m th momen t (for some ﬁ xed m ). W e w ould like a guarantee that when w e do this, that the rand om v ariable is in fact v ery close to b eing Gaussian. What follo ws is a s et of results wh ic h qu an tify this idea. W e ﬁr st restrict ourselv es to one random v ariable to in tro du ce the analytic tools we need. In what follo ws, w e use the f ollo w in g normalisation for our fourier transf orms in R n : ˆ f ( ξ ) = Z e iξ · x f ( x ) dx This implies th at the P arsev al/ Plancherel theorem tak es the follo wing form: Z | f ( x ) | 2 dx = 1 (2 π ) n Z    ˆ f ( ξ )    2 dξ for f ∈ L 2 ( R n ). The core of the pro of is the follo wing s tatement , whose pro of emplo ys F ourier analytic tec h- niques. W e need the follo wing standard theorem on charac teristic fu nctions (see f or example [38]): 27 Theorem 8 (Characteristic fu nctions) . L et ξ b e a r andom v ariable with distribution function F = F ( x ) and φ ( t ) = E  e itξ  its char acteristic fu nc tion. L et E ( | ξ | n ) < ∞ for some n ≥ 1 , then φ ( r ) exists for al l r ≤ n and φ ( r ) ( t ) = Z ( ix ) r e itx dF ( x ) Mor e o ver, we have an expr ession for the derivatives at 0: E ( ξ r ) = φ ( r ) (0) i r And ﬁnal ly we have the fol lowing T aylor series estimate with e rr or: φ ( t ) = n X r =0 ( it ) r r ! E ( ξ r ) + ( it ) n n ! ǫ n ( t ) wher e the err o r term ǫ n ( t ) → 0 as n → ∞ and is b ounde d: | ǫ n ( t ) | ≤ 3 E ( | ξ | n ) No w: Lemma 13 ( L 2 distance from a Gaussian) . L e t f ∈ L 2 ( R ) b e a pr ob ability density over R whose ﬁrst m moments match those of a standar d Gaussian (whose pr ob a bility density we wil l denote g ). Supp o se that the F ourier tr ansform ˆ f satisiﬁes a tail b ound that    ˆ f ( ξ )    < c/ | ξ | for some c > 0 , then: Z R | f ( x ) − g ( x ) | 2 dx ≤ c ′ m 1 / 8 Pr o of. W e will assume for th e sak e of simplicit y that m is ev en. By Parsev al’s formula, w e ha v e: Z | f ( x ) − g ( x ) | 2 dx = 1 √ 2 π Z | ( f − g ) ˆ ( ξ ) | 2 dξ Both f and g h a v e tail b ounds : f by hyp othesis, and g b ecause th e F ourier transform of a Gaussian is still a Gaussian. Thus if w e truncate the tai ls in an interv al [ − L, L ] w here L = m 1 / 8 : Z R / [ − L,L ]    ˆ f ( ξ )    2 dξ ≤ 2 Z ∞ L 1 ξ 2 dξ ≤ 4 L The F ourier transform of a Gaussian is a Gaussian, and b y app lyin g a standard Gaussian tail b ound [18]: 1 √ 2 π Z ∞ x e − t 2 / 2 dt ≤  1 x  e − x 2 / 2 √ 2 π 28 W e can th en com bine these estimate s using the triangle inequalit y: Z R / [ − L,L ]    ˆ f − g ( ξ )    2 dξ ≤ Z R / [ − L,L ]    ˆ f ( ξ )    2 + | ˆ g ( ξ ) | 2 dξ ≤ 6 L In the interv al [ − L, L ], we no w apply Theorem 8: ( ˆ f − ˆ g )( ξ ) = m X k =0 E f  x k  − E g  x k  k ! ( iξ ) k + ( ǫ f ( t ) − ǫ g ( t )) ( iξ ) m m ! = ( ǫ f ( t ) − ǫ g ( t )) ( iξ ) m m ! No w w e can b oun d the in tegral: Z L − L | ( f − g ) ˆ ( ξ ) | 2 dξ ≤ Z L − L     ( ǫ f ( ξ ) − ǫ g ( ξ )) ( iξ ) m m !     2 dξ ≤ 6  E ( x m ) m !  2 Z L − L t 2 m dt ≤ 6 (2 m/ 2 ( m/ 2)!) 2 2 L 2 m +1 2 m + 1 ≤ 12 2 m + 1 exp  (2 m + 1) log ( L ) − m log(2) − m log  m 2  + m  ≤ c m e − m W e can also give an approximate ve rs ion of this theorem: Lemma 14 (Appr o ximate moment s) . Fix ǫ > 0 , let f ∈ L 2 ( R ) b e a pr ob ability density over R whose ﬁrst m moments satisfy:    E f  x k  − γ k    ≤ ǫ Supp o se that the F o urier tr ansfor m ˆ f satisiﬁes a tail b o u nd that    ˆ f ( ξ )    < c/ | ξ | for some c > 0 , then (for a standa r d G aussian g ); Z R | f ( x ) − g ( x ) | 2 dx ≤ c ′ m 1 / 8 + c ′′ m 2 ǫ 2 e m Pr o of. W e pro ceed as in the pr evious lemma. I t suﬃces for us to b ound the in tegral ov er the in terv al [ − L, L ]. W e apply Cauc hy-Sc h warz for a term wise estimate. Z L − L      m X k =0 E f  x k  − E g  x k  k ! ( iξ ) k + ( ǫ f ( t ) − ǫ g ( t )) ( iξ ) m m !      2 dξ ≤ m Z L − L m X k =0 E f  x k  − E g  x k  k ! ξ k ! 2 +  ( ǫ f ( t ) − ǫ g ( t )) ξ m m !  2 dξ 29 W e can now partition the momen ts into p o wers of 2, s o consider the momen ts where k ∈ [ m/ 2 i +2 , m/ 2 i ]: the in tegral of eac h con tributing term is n o w: Z L − L E f  x k  − E g  x k  k ! ξ k ! 2 dξ = 2( E f  x k  − E g  x k  ) 2 L 2 k + 1 (2 k + 1) k ! ≤ 2  E f  x k  − E g  x k  2 exp  2 k + 1 8 log( m ) − k log k + k  ≤ 2  E f  x k  − E g  x k  2 exp  ( m/ 2 i ) + 2 4 log( m ) − m 2 i +2 log  m 2 i +2  + m 2 i  ≤ 2  E f  x k  − E g  x k  2 exp  m 2 i − 1  Both of our lemmas so far in this section use a tail b ound for the F ourier transform. One w ay to obtain s u c h a tail-boun d is to examine logco n ca v e pr ob ab ility d en sities: Lemma 15 (Log-conca v e d ensities) . L et f : R → R b e a lo gc onc a ve d ensity which is isotr opic and diﬀer entiable, then    ˆ f ( ξ )    ≤ 2 / | ξ | . Pr o of. W e start b y b ounding the magnitude of the F ourier transform by the inte gral of the d eriv a- tiv e. ˆ f ( ξ ) = Z R e iξ x f ( x ) dx = Z R 1 iξ d dx e iξ x f ( x ) dx = Z R 1 iξ e iξ x d f ( x ) dx dx where the third line follo ws by in tegration b y p arts and noting that in the limit f ( x ) → 0 as x → ±∞ . This allo ws us to b ound ˆ f ( ξ ):    ˆ f ( ξ )    ≤ 1 | ξ | Z R   f ′ ( x )   dx Let us no w turn to logc oncav e d ensities. Since f is logco n cav e, we can wr ite it as f ( x ) = e h ( x ) where h is conca ve. Because f is a prob ab ility densit y , we must ha ve h ( x ) → − ∞ as x → ± ∞ , in whic h case since h is conca ve there exists a u n ique in terv al [ a, b ] where h ( x ) tak es a maxim um. This fully determin tes the sign of the deriv ativ e: h ′ ( x ) = 0 in this in terv al h ′ ( x ) < 0 for x < a and h ′ ( x ) > 0 for x > b . T he same signs pattern holds for f ′ , as multiplicat ion b y e − h ( x ) do es not change the sign. W e can no w compute the in tegral b y ap p lying the fundament al t h eorem of calculus: Z R   f ′ ( x )   dx = Z a −∞ f ′ ( x ) dx + Z b a f ′ ( x ) dx + Z ∞ b − f ′ ( x ) dx = lim t →∞ ( f ( a ) − f ( − t )) + ( f ( b ) − f ( a )) + ( − f ( t ) + f ( b )) = f ( a ) + f ( b ) = 2 f ( a ) 30 W e no w apply the follo wing lemma [31], which yields the desired result. Lemma 16 (Upp er b ound on logconca v e functions) . L et f b e an iso tr opic lo gc onc ave density in one dimension, then | f ( x ) | ≤ 1 . Then as a corollary to Lemma 13: Corollary 17 ( L 2 distance for log conca ve den s ities) . L et f : R → R b e an isotr op ic lo gc onc ave density whose ﬁrst m moments match a Gaussian g , then: k f − g k ≤ c m 1 / 8 Pr o of. First, consider th e case when f ( x ) is diﬀeren tiable. W e already know that f ∈ L 1 ( R ); sin ce f ( x ) is b ounded by 1 (Lemma 16), then w e h a v e that f ( x ) ∈ L 2 ( R ) b ecause f ( x ) 2 ≤ | f ( x ) | . W e can no w apply Theorem 13 with the tail b ound guaran teed b y Lemma 15. F or the case when f ( x ) is not diﬀerentia ble, we can p erturb b y a sm all Gaussian random v ariable: let X ∼ f , and let Z ∼ N (0 , 1) b e an indep end en t normal v ariable. Fix a parameter τ ∈ [0 , 1]: Y τ = (1 − τ ) X + p 2 τ + τ 2 Z is isotropic. Moreo v er, since th is th e su m of tw o indep en den t logco n cav e r andom v ariables, its densit y is also logconca v e. Let h 1 denote the dens it y of (1 − τ ) X and h 2 the densit y of √ 2 τ + τ 2 Z , then the densit y of our new random v a riable is giv en by: h 1 ∗ h 2 ( x ) = Z ∞ −∞ h 1 ( x − t ) h 2 ( t ) dt The con v olution of these t w o distributions is (inﬁnitely) diﬀerentia ble b ecause h 2 is (inﬁnitely) diﬀeren tiable: d dx ( h 1 ∗ ) =  d dx h 1  ∗ h 2 = h 1 ∗  d dx h 2  Th u s Y τ satisﬁes th e h yp otheses of Lemma 15, and we h a v e a tai l b ound f or Y τ as long as τ > 0. The ﬁr st m momen ts of Y are also close to those of X : if we compute the j th momen t for example: E  Y j τ  = E   (1 − τ ) X + p 2 τ + τ 2 Z  j  = (1 − τ ) j E  X j  + j X i =1  i j  (1 − τ ) j ( p 2 τ + τ 2 ) i − j E  X i  E  Z i − j  Th u s we can pick τ small enough so that:   E  Y j τ  − E  X j    ≤ ǫ 31 for a ny ǫ > 0. I n the proof of Lemma 13 then , in stead of the momen t diﬀerences fr om th e ﬁrst m terms of the c h aracteristic function b eing 0, w e can mak e them arbitrarily small b y c ho osing smaller τ . Thus w e hav e the conclusion of Lemma 13 for Z . T o conclude, w e note that: lim τ → 0 k h 1 ∗ h 2 − f k 2 = 0 in whic h case, taking τ small enough allo ws us to ap p ly the triangle inequalit y to: k f − g k ≤ k f − h k + k h − g k W e also need a lemma to con v ert our L 2 estimates to L 1 estimates. This is not general in p ossible, but since log conca ve fun ctions ha ve exp onen tial tailb ounds: Lemma 18 ( L 2 to L 1 ) . L et f , g : R → R isot r opic lo gc onc ave densities suc h that f or some m > 0 that: Z | f ( x ) − g ( x ) | 2 dx ≤ 1 m then: Z | f ( x ) − g ( x ) | dx ≤ c log( m ) √ m for some absolute c onstant c > 0 . Pr o of. Fix L = ( 1 c ) log ( m ), then as b efore: Z | f ( x ) − g ( x ) | dx = Z | x |≤ L | f ( x ) − g ( x ) | dx + Z | x | >L | f ( x ) − g ( x ) | dx W e can now use tail b oun d for logconca v e fu nctions o v er the tail [21], in particular, for isotropic logconca v e random v a r iables X i n R n , w e ha v e (for some ﬁxed absolute constants c, C > 0: Pr    k x k − √ n   ≥ t √ n  ≤ C exp  − cn 1 2 min( t, t 3 )  In one dimension, this sho ws that the in tegral of our tail is b ound ed b y C /m (after application of triangle inequalit y). No w inside the int erv al [ − L, L ], w e w ill apply the Cau ch y-Sc hw artz inequalit y: Z [ − L,L ] | f ( x ) − g ( x ) | dx ≤ Z [ − L,L ] | f ( x ) − g ( x ) | 2 dx ! 1 / 2 Z [ − L,L ] 1 dx ! 1 / 2 ≤ √ 2 c √ m log( m ) Pr o of of The or em 1. The p r o of follo ws from Lemma 14, Corollary 17 and Lemma 18, noting that the the tec h nique of Corollary 17 can b e applied to Lemma 14 in the same w a y as Lemma 32 6 Applications In this section, we giv e some applications of our general theorems and w e some explicit b oun ds for momen t-learnable triples and the running t ime of our algorithms on these triples. W e mak e explicit in our analysis the three k ey con tributions to runtime – ho w m an y moment s are required , ho w eﬃcient ly these momen ts can be sampled, and ho w eﬃcien tly the hypothesis can b e learned in the k -dimensional relev an t subsp ace. 6.1 Momen t estimation In this sectio n , w e highligh t some further consequences and su btleties of using moment s in algo- rithms. The use of m omen ts is a very natural wa y of studying ran d om v ariables. F or example, th e inequalities of Marko v, C hebyshev and Ch ernoﬀ are statement s abou t the relationship b et ween a ﬁnite sequence of momen ts and the tail of a distribution. If w e consider an inﬁnite sequence of momen ts, often these will determin e the distribu tion un iquely (the moments problem). One of the critical terms in the runt ime g iven in our main t h eorems is C F ( m, ǫ ): th e samp le complexit y of approxi m ating the m th momen t tensor of distribution F to within accuracy (in the momen t metric ab o ve ). Th e comp etitiv eness of our algorithm with other learning algorithms dep end s on th e n u m b er of momen ts we n eed (ie the previous section), and th e n u m b er of samples we need to attain the r equired accuracy . Th is latter problem is w ell-studied, and there is an impressive b o d y of literat u re surroun ding it. In particular, when m = 2, the p r oblem is of inte r est to r andom matrices communit y , who h a v e pro vided stron g b ound s in a n um b er of imp ortan t cases. W e will pro vide a b rief o verview of these results, but this b y no means is in tended to b e a comprehensive surve y of the literature! When the distribution F is isotropic and almost surely supp orted in a ball of radius O ( √ n ), Rudelson [35] ga v e a v ery strong b ound on C F ( n, ǫ ) to ac hiev e the follo wing guaran tee: E      1 N N X i =1 x i x T i − I      ! ≤ ǫ. Rudelson requir ed only O ( n log( n )) samples when F is almost surely supp orted on a b all of radius O ( √ n ), and where the constan t is dep en d en t on ǫ . Ad amczak et al. [1] were able to improv e this b ound of O ( n ) samples. Th eir assum p tions were supp ort on a ball of radiu s O ( √ n ) as b efore, and a sub exp onen tial momen t condition: sup k v k = 1 E  ( x T v ) p  1 /p = O ( p ) As an app lication, they show ed that logconca v e distributions satisfy these assumptions, and thus their co v ariance m atrices can b e sampled v ery eﬃcien tly . S u bsequent w ork b y V ershynin and collab- orators [39, 44] has br oadened the class of eﬃcien tly samp lable co v ariance matrices to d istributions where 2 + ǫ momen ts exist and also to d istributions where the m th momen t is b ounded by K m for some constan t K . Finally , in the setting of higher momen ts, there is the resu lt of Guedon and Rudelson [22], whic h giv es the sample complexit y of sampling for h igher momen ts of logc oncav e d istributions. Their resu lt is that O ( n m/ 2 log( n )) samples are necessary to appro ximate moment s in all d irections up to an 1 + ǫ f actor. In particular, this leads to the observ ation that exp licitly computing a sample moment tensor from n m/ 2 samples is actually less eﬃcien t than simply storing the p oints, 33 computing the inner pro du cts to the appropriate p o wers and summing. This last result is used in ou r applications in Section 6, as it allo ws us to handle man y distribu tions eﬃcien tly , i n cluding Gaussians and u niform distributions o v er conv ex b o dies. 6.2 Robust learning F or learning o v er a k -dimensional subspace, w e ha ve th e f ollo wing prop osition: Prop osition 9 (V C dimension) . L et H b e a hyp oth e si s class with VC dimension d . L et ℓ ∈ H b e a subsp ac e junta with r elevant subsp ac e V , wher e dim ( V ) = k . L et U b e a k d imensional sub sp ac e wher e ℓ ( π U ) lab els a 1 − ǫ fr action of p oints c orr e ctly. Then we c an le arn ℓ with sample c omplexity (1 /ǫ ) c 2 d log(1 /ǫ )+ c 2 log(2 /δ ) with pr ob ability at 1 − δ . Pr o of. T o come up with a hyp othesis o ve r U , we tak e a new s et of samples S of size m and pro ject them onto U . By robus tn ess of H und er F , w e know that Pr ( ℓ ( π U ( x )) = ℓ ( x )) ≥ 1 − ǫ . Then we guess the correct lab els b y trying all relab elings of subsets of size ǫm . One of these r elab elings will giv e us a lab eling consisten t with ℓ v iewed as a fun ction of the k -co ordin ates in U . F or eac h relab eling we attempt to learn the lab eling f unction. On the correct relab eling, w e can learn ℓ to with at most ǫ fraction of errors. By the theorem abov e, our to tal err or o v er R n is 2 ǫ . T o b ound m , w e apply an idea from [9] via a sligh t extension (Theorem 5 of [2]). The required b ound is m ≥ (32 /ǫ ) log ( C [ m ]) + (32 /ǫ ) log (2 /δ ) where C [ m ] is the maxim u m num b er of distinct lab elings obtainable u sing concepts in H o v er R k . In particular, we h a v e C [ m ] ≤ P d i =0  m i  , wh ence C [ m ] ≤ m d . A computation revea ls that m ≥ c ( d/ǫ ) log (1 /ǫ ) + ( c ′ /ǫ ) log (2 /δ ) suﬃces. Th e num b er of relab elings is  m ǫm  , whic h is upp er b ounded b y ( m/ǫ ) mǫ ≤ (1 /ǫ ) c 2 d log(1 /ǫ )+ c 2 log(2 /δ ) . As ment ioned previously , we can view the w ork of [42] as a sp ecializatio n o f our algorithms to the j = m = 2 case in FindBasis . W e giv e examples here wh ere the second m omen t do es not suﬃce, and we m ust use higher momen ts to resolv e the relev an t subspace V . Ou r examples are: (1) hyperr ectangles (cub oids) in balls, (2) sub sets of balls, and (3) concepts whic h ha ve compact supp ort. In all our examples, the algorithm used is LearnUnderGaussian . W e will pro ve that we can ﬁnd the relev an t sub spaces by run ning FindBasis on either the full distribution or distr ibution conditioned on p ositiv e lab els (the “p ositiv e” distribution). W e use the u niform distrib u tion o ver a b all in R k in the r elev a nt s u bspace. W e n eed the follo wing elemen tary fact. Claim 19 (Isotropic balls) . L et F b e the uniform d istribution (with density ρ ) over B R (0) ⊂ R n wher e R = √ n + 2 , then E  ( x T u ) 2  = 1 for any unit ve ctor u . By a h yp errectangle, we refer to a region of space w h ic h is the Cartesian pro duct of closed in terv als i.e. S = [ a i , b i ] × · · · × [ a k , b k ] ⊂ R k : Application 1 (Hyperr ectangles in balls) . L et F = F V F W wher e F V is a unif orm distribution over a b al l B and k = dim ( V ) , F W is any Gaussian over n − k dimensions. L et S ⊂ B denote a (hyp er)r e ctangle in V . T ake the hyp othesis class H = { ( χ S ( π V ))( x ) : S ⊂ B } to b e the set of functions which assigns p ositive lab els to p oints whose pr o je ction to V lies in the interior of r e ctangle S . Prop osition 10. The triple ( k, F , H ) as deﬁne d in A pplic atio n 1 is (4 , 6 / (5 k )) moment-le ar nable with time and sample c omplexity p oly( k , 1 /ǫ ) + C k ,ǫ n 2 . 34 Pr o of. Without loss of generalit y , w e ma y assu me th at B = B √ n +2 (0) after isotropic transformation, and that the Gaussian o ver F W is a standard n -dimensional Gaussian. F urtherm ore, we ma y assum e that S is cen tered on the origin as w ell (i.e. w e apply Lemma 2 to the p ositiv ely lab eled p oin ts). Supp ose w e no w run LearnUnderGaussian on the p ositive ly lab eled s amples. W e start with the seco n d momen t ( r = 2) in our alg orithm FindBasis : the second momen ts of a uniform distri- bution o v er a rectangle are fu lly d etermined b y the second momen ts along the axes of the r ectangle. In particular, FindBasis us ing the second momen ts will simply giv e us every axis of the rectangle where the second momen t is not 1. A simple calculati on of th e momen ts of a u n iform d istribution o v er a rectangle along axis x i where the recta n gle has length 2 S i giv es: E  x 2 i  = Z S i − S i x 2 i 1 2 S i dx i = S 2 i 3 . Th u s, using the second m omen t will give u s all the axes of our hyperr ectangle except wh ere th e rectangle has length 2 S i = 2 √ 3. Pro jecting orthogonally to these axes, we no w consider the third momen ts ( r = 3): the third moment of our uniform rectangle is clearly 0 in ev ery direction by symmetry of the r ectangle. Thus, w e tu rn to the fou r th moment – note that ﬁ xing S i = √ 3 ﬁxes the fourth m oment along eac h axis of the rectangle , in particular: E  x 4 i  = Z S i − S i x 4 i 1 2 S i dx i = 9 5 . Unfortunately , the equ alit y of the f ou r th moment along the axes of a rectangle do es n ot necessarily imply the same fourth moment in eve r y direction. Ho wev er, iterating Lemma 3 allo ws us to b ound the fourth m oments a w a y f r om the f ourth momen t of a Gaussian γ 4 = 3: E  ( x T u ) 4  =  9 5 − γ 4  X i ∈ R ′ u 4 i + γ 4 where the sum is take n o v er d irections corresp ondin g to axes where S i = √ 3. No w b y app lying the Lagrangian st yle tec hniques of Lemma 4, we can b oun d th is b y: E  ( x T u ) 4  ≤ γ 4 − 6 5 k Th u s, we ha v e our moment learnability using only the fourth moment! No w th at w e hav e the relev an t sub space V , we can simply learn our rectangle in a dimens ion k space, whic h tak es p oly( k ) time. Moreo v er, note th at since all the distributions are logc oncav e, w e can apply the m oment sampling resu lts of Guedeon and Rudelson mentioned in Section 6.1 – in particular, w e c an tak e the n umb er of samples required to b e C F ( m, ǫ ) = C ǫ n 2 . Thus this giv es a ﬁnal runtime of p oly( k ) + C k ,ǫ n 2 where C k ,ǫ . The k ey p oint here is that we h av e very lo w p olynomial d ep endence in n . T his conforms we ll with our mod el where we think of k as b eing small compared to n . W e can, in fact, pro ve a stronger result — we can alw ays ﬁnd th e r elev an t subspace if F V is a un iform d istribution o v er a ball: Application 2 (Uniform distributions o ver b alls) . L et F = F V F W wher e F V is a uniform distri- bution over a b al l B and k = dim ( V ) , F W is a Gaussian. L et H b e a r o bu st hyp oth esis class which we c an le arn with c omplexity b ounde d by T ( k , ǫ ) . 35 Prop osition 11. The triple ( k , F , H ) as deﬁne d in Applic ation 2 is (4 , Ω(1)) moment-le a rnable, with the time and sample c omplexity b ounde d by T ( k , ǫ ) + C k ,ǫ n 2 . Pr o of. W e w ill examine what h app ens when w e run FindBasis on the full distribution (as opp osed to the p ositiv e distr ib ution in the previous example). W e compute the fourth momen t of a ball of radius R = √ n + 2. F or simp licit y , we will assume that k = 2 l + 1 for some p ositiv e inte ger l ie k is odd : E  x 4 1  = Z B R (0) x 4 1 ρdx = Z R − R Z B k − 1 √ R 2 − x 2 1 (0) x 4 1 ρdx 2 · · · dx k dx 1 = 1 v ol  B k R (0)  Z R − R x 4 1 v ol  B k − 1 √ R 2 − x 2 1 (0)  dx 1 = v ol  B k − 1 R (0)  v ol  B k R (0)  Z R − R x 4 1  1 − x 2 1 R 2  l dx 1 W e ﬁrst exa m ine the v olume ratio: using the recurr ence: v ol  B k R (0)  = 2 π R 2 k v ol  B k − 2 R (0)  and unrolling the recurren ce, we hav e: v ol (2 l ) v ol (2 l + 1) = (2 l + 1)!! 2 R (2 l )!! = 1 2 R (2 l + 2)!! ( l + 1)!2 l +1 l !2 l = 1 2 R (2 l + 2)!! ( l + 1)! l !2 2 l +1 Applying Stirling’s appro ximation, w e ha ve: v ol (2 l ) v ol (2 l + 1) = 1 2 R p 2 π (2 l + 2) 2 π p l ( l + 1) 1 2 2 l +1  2 l + 2 e  2 l +2  e l  l  e l + 1  ( l +1) = 1 2 R 1 √ π l 2 e  l + 1 l  l ( l + 1) = 1 R √ π l + 1 √ l = 1 √ π 1 + s 1 l ( l + 2) ! Returning to th e integrand, w e can simplify it somewhat: Z R − R x 4 1  1 − x 2 1 R 2  l dx 1 = 2 Z R 0 x 4 1  1 − x 2 1 R 2  l dx 1 36 By explicitly taking the in tegral (u sing a compu ter algebra system), w e ha v e: Z R 0 x 4 1  1 − x 2 1 R 2  l dx 1 = 3 √ π (2 l + 3) 5 / 2 Γ( l + 1) 8Γ( l + 7 / 2 ) where Γ h ere is the usual gamma f unction. The b eha vior of this function is as follo ws: lim l →∞ 3 √ π (2 l + 3) 5 / 2 Γ( l + 1) 8Γ( l + 7 / 2) = 3 r π 2 Moreo v er, the fu nction is monotonic increasing for l > 0, and tak es on the v a lu e 56 √ 7 / 45 at l = 2. Th u s, com bining these facts with the estimate of the v olume r atios, we can see that the fourth momen t of a ball is b ounded a w a y from the fourth momen t of a standard Gaussian by a co n stan t, hence w e can tak e η = Ω(1). Once w e h a ve the relev an t s ubspace V , we can pr o ject the samples to V and learn in time T ( k , ǫ ). The ru n time in this case is T ( k , ǫ ) + C k ,ǫ n 2 . As a sp ecialization, when the p ositive examples are d etermin ed by a conv ex subset of the unit ball, T ( k , ǫ ) ≤ ( k /ǫ ) O ( k ) . In a k -dimen s ional subspace, w e can learn a con vex subset of the ball b y simply taking the conv ex hull of ( k /ǫ ) O ( k ) random p ositive p oin ts. F rom the classical appro ximation theory of conv ex b o d ies [34], we obtain an appro ximation to the true con vex b o dy to within relativ e error ǫ , giving total run time ( k /ǫ ) O ( k ) + C k ,ǫ n 2 . This complements [42] wh ic h pro vides a PCA- based algorithm for learnin g conv ex b o d ies w h en the distrib ution in the relev an t sub s pace is also Gaussian. In that p ap er, it is mentio ned that standard PC A fails if th e full d istr ibutions is not a Gaussian. W e n o w present an example that relies on b ound ed ness – either of the fu ll d istribution in the relev an t su b space, or the p ositiv e distribution. This rather general result uses relativ ely many momen ts. Application 3 (Compact distr ib ution in relev an t sub space) . L et F = F V F W wher e F W is any Gaussian over n − k dimensions. T ake H to b e a r obust hyp othesis c lass le arnable with c omplexity T ( k , ǫ ) . A ssume that either F V or H has i ts supp ort c ontaine d in B g ( k ) (0) . Prop osition 12. The triple ( k , F , H ) describ e d in Applic ation 3 is ( g ( k ) , Ω(1)) moment-le arnable with c omplexity T ( k , ǫ ) + C k ,ǫ n O ( g ( k ) 2 ) . Pr o of. S upp ose w e ru n FindBasis on the fu ll d istr ibution or th e p ositive distribution, w hic hever is con tained in a ball of radius g ( k ). Consider the relev an t subspace. If w e ﬁx some even momen t m then we can giv e exp licit bou n ds on the momen ts: E (( x t ) m ) ≤ g ( k ) m . On the other hand, the ev en moments of a Gaussian are giv en by ( m − 1)!! = m ! / ( m/ 2)!2 m/ 2 whic h gro ws m uc h more rapidly . If w e tak e lo garithms on b oth sides, th en w e can ﬁnd m = m ( k ) such that: m log ( g ( k )) ≤ log  m ! ( m/ 2)!2 m/ 2  37 Applying Stirling’s appro ximation yields: m 2 log( g ( k ) 2 ) ≤ m log ( m ) − m − m 2 log  m 2  + m 2 − m 2 log(2) ≤ m 2 log( m ) − m 2 So if w e pic k m = 2 g ( k ) 2 , then the diﬀerence in the momen ts should b e Ω (1). Th u s, simply runn in g FindBasis on the full distribu tion will allo w us to reco v er the relev an t subspace, at which p oin t we can learn H in R k (doable in time T ( k )). It remains to pro ve that w e can sample th e ﬁrst 2 g ( k ) 2 momen ts of a b ounded distribution eﬃcien tly: since it is b ounded , all momen ts exist. In p articular, if we requir e 2 g ( k ) 2 momen ts, then the 4 g ( k ) 2 momen t is b oun ded by g ( k ) 4 g ( k ) 2 . T h en by applying Chebyshev’s inequalit y , w e see that we need at most g ( k ) O ( g ( k ) 2 ) samples in the relev an t subs p ace. The o v erall r unt ime for this algorithm is then T ( k , ǫ ) + C k ,ǫ n O ( g ( k ) 2 ) . References [1] Radosla w Adamczak, Alexander Litv ak, Alain Pa jor, and Nicole T omczak-Jaege r m ann. Quan- titativ e estimates of the con v ergence of the empirical co v ariance matrix in logconca v e e n sem- bles. J . Amer. Math. So c. , 233:535–5 61, 2011. [2] Rosa I. Arriaga an d San tosh V empala. An a lgorithmic theory of learning: Robu s t concepts and random pr o jection. Machine L e arning , 63(2 ):161–182, 2006. [3] J W Barness, Y Carlin, and M L Stein b erger. Pro c. the glob ecom conference. pages 1251–1 255, 1982. [4] Eric B. Baum. On learning a u n ion of half spaces. J. Complexity , 6(1):67– 101, 1990. [5] An thony J. Bell and T errence J. S ejno wski. An in f ormation-maximizatio n approac h to blind separation and b lind decon v olution. Neur al Comput. , 7( 6):1129–11 59, No v ember 1995 . [6] Avrim Blum and Ra vi Kannan. Learning an intersectio n of k h alfsp aces o ve r a un iform dis- tribution. In F OCS , pages 31 2–320, 1993. [7] Avrim Blum and Ravindran Kannan. Learning an inte rs ection of a constan t n umber of h alfs- paces o v er a uniform distribution. J . Comput. Syst. Sci. , 54(2):3 71–380, 1997. [8] Avrim L. Blum. Rele v an t examples and relev an t f eatures: Thoughts from computational learning theory . In AAAI F al l Symp osium on ‘R elevanc e’ , 1994. [9] Anselm Blumer, Andrezj Ehr enfeuc ht, Da vid Haussler, and Manfr ed K. W armuth. Learnabilit y and the v arpnik-c h erv ov enkis dimension. Journal of the ACM , 6:929–965 , 19 89. [10] S . Ch arles Brubak er and Sant osh V empala. Extensions of Princip al Comp onent Analysis . PhD thesis, Georgia Institute of T ec hn ology , 2009. [11] S . Charles Brub ak er and Santosh V empala. Random tensors and p lan ted cliques. In RANDO M , pages 406 –419, 2009. [12] Anthon y Carb ery and James W righ t. Distributional and L q norm inequalities for p olynomials o v er con v ex b o d ies in R n . M athematic al R ese ar ch L etters , 8:233– 248, 2001. 38 [13] J .-F. Cardoso and B.H. Laheld. Equiv arian t adaptiv e source separation. Signal P r o c essing, IEEE T r ansactions on , 44(12) :3017 –3030, dec 1996. [14] J .F. Card oso. Sou r ce sep aration us ing higher order momen ts. In International Confer enc e on A c oustics, Sp e e ch, and Sig nal Pr o c essing , 1989. [15] P . Comon. Ind ep endent C omp onent Analysis. In Pr o c. Int. Sig. Pr o c. Workshop on Higher- Or der Statistics , pages 111–120, Chamrousse, F r ance, July 10-12 1991. Keynote address. Republished in Higher-Or der Statistics , J .L.Lacoume ed., E lsevier, 199 2, pp 29– 38. [16] P . Comon. Ind ep endent Comp onent Analysis, a new concept ? Signal Pr o c essing, E lsevier , 36(3): 287–314, April 1994 . Sp ecial issue on Higher-Ord er Statistics. hal-00417 283. [17] Nathalie Delfosse and Philipp e Loubaton. Adaptiv e blind separation of indep end en t sources: A d eﬂation approac h. Signal P r o c essing , 45(1):59 – 83, 199 5. [18] William F eller. An Intr o duction to Pr ob ability The ory and its Applic a tions, vol 1 . John Wiley and sons, 1968. [19] Alan F rieze, Mark Jerr um, and Ra vi Kannan. Learning linear transformations. In FOCS , pages 359 –368, 1996. [20] Alan F rieze and Ravi Kannan. A new app roac h to the plan ted clique problem. In Pr o c e e dings of FST and TCS , 2008. [21] O livier Gu edon and Eman uel Milman. In terp olating thin-sh ell and sharp large-deviation esti- mates for isotropic log-conca v e measures. Ge ometric F u nc tional Analysis , to app ear, 2011. [22] O livier Guedon and Mark Ru delson. L p momen ts of rand om vect ors via ma jorizing measures. A dvanc e s in Mathematics , 208:7 98–823, 2007. [23] A. Hyv arinen. F ast and robust ﬁxed-p oin t algorithms for indep end en t comp onent a n alysis. Neur al Networks, IE EE T r ansactions on , 10(3): 626 –634, ma y 1999 . [24] Aap o Hyv arinen, Juha Karhunen, and E r kki Oja. Indep endent Comp onent Analysis . John Wiley and Son s , 20 01. [25] C hristian Ju tten and Jeanny Herault. Blind separation of sources, part i: An adaptive algo- rithm based on neur omimetic arc hitecture. Signal Pr o c essing , 24(1):1 – 10, 19 91. [26] Ad am R. K liv ans, Philip M. L ong, and Alex K. T ang. Baum’s algorithm learns int ers ections of halfspaces with resp ect to log-conca v e d istributions. In APPRO X-RAN DOM , p ages 588–6 00, 2009. [27] Ad am R. Kliv ans, Ry an O ’Donnell, and Ro cco A. Serv edio. Learning geome tric concepts via gaussian surface area. In FOCS , pages 541– 550, 2008. [28] T a mara Kolda. Shifted p o wer for computing tensor eigenpairs. arXiv:1007.126 7v2, F ebruary 2011. [29] T a mara G. Kolda and Brett W. Bader. T ensor decomp ositions and applicatio n s. SIAM R ev. , 51(3): 455–500, August 2009 . 39 [30] J ean-Louis Lacoume and P . Ruiz. Separation of indep end en t sour ces fr om correlated inputs. IEEE T r ansactions on Signal Pr o c essing , 40(12 ):3074–307 8 , 1992. [31] L´ aszl´ o Lo v´ asz an d San tosh V empala. The geometry of logconca v e functions and s ampling algorithms. R a ndom Struct. Algorith ms , 30(3):307– 358, 2 007. [32] Elchanan Mossel, Ryan O’Donnell, and Ro cco A. Servedio. Learning functions of k r elev an t v ariables. Journal of Computer and System Scienc es , 69:4 21–434, 2004 . [33] Ph ong Q. Nguy en and O d ed Regev. Learning a parallelepip ed: Cryptanalysis of ggh and n tru signatures. In E UROCR YPT , 2006. [34] J . M. Wills (editors) P . M. Grub er. Handb o o k of c onvex ge ometry. V ol. A. B . North-Holland, Amsterdam, 199 3. [35] Mark Rudelson. Random vecto r s in the isotropic p osition. J. of F unc tional Analysis , 164:60–72, 1999. [36] J ac k Sc hw artz. F ast probabilistic algorithms for v eriﬁcation of p olynomial iden tities. Journal of the ACM , 27:701 –717, 1980. [37] O S halvi and E W einstein. New criteria for blind decon v olution of nonminimum phase systems (c hannels). IEEE T r ansactions on Information The o ry , 36(2):312–3 21. [38] Alb ert Shiry aev. Pr ob ability . S pringer-V erlag, New Y ork, NY, 1995. [39] Nikh il Sriv asta v a and Roman V ershynin. Cov ariance estimates for distribu tions with 2 + ǫ momen ts. submitte d , 2011. [40] L. G. V alian t. A theory of the learnable. Commun. ACM , 27(11 ):1134–114 2 , No vem b er 1984. [41] S an tosh V empala. A r andom sampling b ased algorithm for learning the intersectio n of h alf- spaces. In F OCS , pages 50 8–513, 1997. [42] S an tosh V empala. Learning con ve x concepts from gaussian distribu tions with p ca. In FOCS , pages 541 –550, 2010. [43] S an tosh V empala. A r andom sampling b ased algorithm for learning the intersectio n of h alf- spaces. JACM , 57:32 :1–32:14, 2010. [44] Roman V ershynin. Ho w close is the samp le co v ariance to the actual co v ariance matrix? Journal of The or etic al Pr ob ability , to app ear, 201 0. 40

Structure from Local Optima: Learning Subspace Juntas via Higher Order PCA

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment