Fast approximation of matrix coherence and statistical leverage

F ast appro ximatio n of m atrix coherence and stati stical lev erage P etros Drineas ∗ Malik Magdon-Ismail † Mic hael W. Mahoney ‡ Da vid P . W o o druﬀ § Abstract The statistic al lever age sc or es of a ma trix A are the squar ed row-norms of the matrix co n- taining its (top) left singular v ectors and the c oher e nc e is the larges t leverage scor e. These quantities are of interest in recent ly-p o pular pro blems such as matrix completion and Nystr¨ om- based low-rank ma tr ix approximation as w ell as in la rge-sc a le statistica l data analysis appli- cations more generally; moreover, they are of int ere s t s ince they deﬁne the key structural nonunif or mit y that must b e dealt with in developing fast ra ndomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrar y n × d matrix A , with n ≫ d , and that returns a s output relative-erro r approximations to al l n of the statistical lever- age sco res. The propo sed algo rithm runs (under as sumptions on the precis e v alues of n a nd d ) in O ( nd log n ) time, as opp osed to the O ( nd 2 ) time r e quired b y the na ¨ ıve alg orithm that in- volv es co mputing an o rthogona l basis for the rang e o f A . O ur analys is may b e view ed in terms of computing a relativ e-er ror approximation to an under constrained least-square s approxima- tion problem, or, r elatedly , it may be viewed as an application of Jo hnson-Lindenstra us s type ideas. Several prac tically-imp ortant e xtensions of our basic result are also describ ed, including the approximation of so- called cross-le verage scores, the ex tension of these ideas to matrices with n ≈ d , and the ex tension to streaming environmen ts. 1 In tro duc tion The concept of statistic al lever age measures the exten t to whic h the singular vecto rs of a matrix are correlated with the standard basis an d as suc h it has found u s efulness recen tly in large-scale data analysis and in th e analysis of randomized matrix algo rithms [47, 33 , 21]. A related notion is that of matrix c oher enc e , whic h h as b een of in terest in r ecen tly p op u lar problems suc h as matrix completion an d Nystr¨ om-based low-rank matrix appro ximation [13, 46]. Deﬁned more precisely b elo w, the statistical lev erage scores ma y b e computed as the squared Euclidean norms of the ro ws of the matrix con taining the top left singular v ectors and the coherence of the matrix is the largest statistica l lev erage score. Statistical leve rage scores ha v e a long history in statistical data analysis, where they ha v e b een used for outlier d etection in regression d iagnostics [29, 14]. Statis- tical leve rage scores h a v e also prov ed cru cial recen tly in the develo pment of imp ro v ed w orst-case randomized matrix algorithms that are also amenable to h igh-qualit y numerical implemen tation and that are usefu l to d omain scien tists [21, 33, 12, 20 , 45, 22]; s ee [32] for a detailed discus- sion. The na ¨ ıv e and b est previously existing algorithm to compute these scores w ould compute an orth ogonal b asis for th e d ominan t part of the sp ectrum of A , e.g. , the basis provided by the Singular V alue Decomp osition (SVD) or a basis provided by a Q R decomp ositio n [26], and then use that b asis to compute diagonal elemen ts of the pro jection matrix on to the sp an o f that basis. ∗ Dept. of Computer Science, Renssela er P olytec hnic Institute, T roy , NY 12180. Email: drinep @cs.rpi.edu † Dept. of Computer Science, Renssela er P olytec hnic Institute, T roy , NY 12180. Email: magdon@cs.rpi.edu ‡ Dept. of Mathematics, Stanford Univ ersity , Stanford, CA 94305. Email: mmahoney@cs.stanford.edu § IBM Almaden Research Cen ter, 650 Harry Road, San Jose, CA 95120 USA. Email: dpw oo dru@u s.ibm.com 1 W e presen t a randomized algorithm to compute relat ive -error approxima tions to ev ery statis- tical leve rage score in time qu alitativ ely faster than the time requ ired to compu te an orthogonal basis. F or the case of an arbitrary n × d matrix A , with n ≫ d , our main algorithm runs (under assump tions on the precise v alues of n and d , s ee Theorem 1 for an exact statemen t) in O ( nd log n/ǫ 2 ) time, as opp osed to the Θ ( n d 2 ) time r equired by the na ¨ ı ve algorithm. As a corollary , our algorithm p ro vides a relativ e-error approximat ion to the coherence of an arbitrary matrix in the same time. In addition, sev eral practically-imp ortan t extensions of the basic idea underlying our main algorithm are also describ ed in this pap er. 1.1 Ov erview and deﬁnitions W e start with the follo wing deﬁn ition of the statistic al lever age sc or es of a m atrix. Deﬁnition 1. Given an arbitr ary n × d matrix A , with n > d , let U denote the n × d matrix c onsisting of the d left singular ve ctors of A , and let U ( i ) denote the i -th r ow of the matrix U as a r ow ve ctor. Then, the statistical leverag e scores of the r ows of A ar e gi ven by ℓ i =   U ( i )   2 2 , (1) for i ∈ { 1 , . . . , n } ; the coherence γ of the r ows of A is γ = max i ∈{ 1 ,...,n } ℓ i , (2) i.e., it is the lar gest statistic al leve r age sc or e of A ; and the ( i, j )-cross-lev erage scores c ij ar e c ij =  U ( i ) , U ( j )  , (3) i.e., they ar e the dot pr o ducts b etwe en the i th r ow and the j th r ow of U . Although we ha v e deﬁn ed these qu an tities in terms of a particular basis, they clearly do not dep end on that particular b asis, but only on the sp ace sp anned b y that basis. T o see this, let P A denote the pr o jection matrix on to the span of the column s of A . T hen, ℓ i =   U ( i )   2 2 =  U U T  ii = ( P A ) ii . (4) That is, the statistica l lev erage scores of a m atrix A are equal to the diagonal elemen ts of the pro jection matrix onto the sp an of its columns. 1 Similarly , the ( i, j )-cross-lev erage scores are equal to the oﬀ-diagonal elements of this pro jection matrix, i.e. , c ij = ( P A ) ij =  U ( i ) , U ( j )  . (5) Clearly , O ( nd 2 ) time suﬃ ces to compute all the statistical lev erage scores exactly: simply p erform the SVD or compu te a QR decomp osition of A in order to obtain any orthogonal basis for the range of A and then compute the Euclidean norm of the r o ws of the resulting matrix. Thus, in this pap er, we are in terested in algorithms that run in o ( nd 2 ) time. Sev eral additional commen ts are w orth making regarding this d eﬁnition. First, since P n i =1 ℓ i = k U k 2 F = d , we can deﬁne a probabilit y distribu tion o ver the ro ws of A as p i = ℓ i /d . As d iscussed 1 In this pap er, for simplicit y of ex p ositio n, w e consider the case th at the matrix A h as rank equal t o d , i.e. , has full column rank. Theoretically , the extension to rank-deﬁcient matrices A is straigh tforw ard—simply mo dify Deﬁnition 1 and thus Eqns. (4) and (5) to let U b e any orthogonal matrix (clearly , with few er than d columns) spanning the column sp ace of A . F rom a numerica l p ersp ective, t hings are sub stantially more subtle, and we leav e this for fut u re w ork t hat considers numerical implementati ons of our algorithms. 2 b elo w, these probabilities ha v e p la y ed an imp ortan t role in r ecen t wo rk on randomized matrix algorithms and an imp ortan t algorithmic question is the d egree to which they are uniform or non un iform. 2 Second, one could also deﬁne lev erage scores for the columns of a “tall” m atrix A , but clea rly those are all equal to one unless n < d or A is rank-deﬁ cient. Third, and more generally , giv en a rank parameter k , one can deﬁne the statistic al lever age sc or es r elative to the b est r ank- k appr oximation to A to b e the n diagonal element s of the pro jection matrix on to the span of A k , th e b est rank- k approxima tion to A . 1.2 Our main result Our main resu lt is a rand omized algorithm f or computing relativ e-error approximat ions to every statistica l lev erage score, as w ell as an ad d itiv e-error approxima tion to all of the large cross- lev erage s cores, of an arbitrary n × d matrix, with n ≫ d , in time qualitativ ely faster than the time required to compu te an orthogonal basis for the range of that matrix. Ou r main algorithm for computing appr o ximations to the statistical le verag e scores (see Algorithm 1 in Section 3) will amoun t to constructing a “randomized sket ch” of th e input matrix and then computin g the Eu- clidean norms of the rows of that sket c h. This sk etc h can also b e used to compute approximati ons to the large cross-lev erage scores (see Algorithm 2 of S ection 3). The follo win g theorem pr o vides our main qualit y-of-appro ximation an d runn ing time r esult for Algorithm 1. Theorem 1. L et A b e a ful l-r ank n × d matrix, with n ≫ d ; let ǫ ∈ (0 , 1 / 2] b e an err or p ar ameter; and r e c al l the deﬁnition of th e statistic al lever age sc or es ℓ i fr om Deﬁnition 1. Then, ther e exists a r andomize d algorithm (Algorithm 1 of Se ction 3 b elow) that r eturns values ˜ ℓ i , for al l i ∈ { 1 , . . . , n } , such that with pr ob ability at le ast 0 . 8 ,    ℓ i − ˜ ℓ i    ≤ ǫℓ i (6) holds for al l i ∈ { 1 , . . . , n } . A ssu ming d ≤ n ≤ e d , the running time of the algorithm is O  nd ln  dǫ − 1  + ndǫ − 2 ln n + d 3 ǫ − 2 (ln n )  ln  dǫ − 1  . Algorithm 1 provides a relativ e-error app ro ximation to all of the statistical lev erage scores ℓ i of A and , assuming d ln d = o  n ln n  , ln n = o ( d ), and treating ǫ as a constan t, its runn ing time is o ( nd 2 ), as desired. As a corollary , the largest lev erage score (and thus the coherence) is appro ximated to r elativ e-error in o ( nd 2 ) time. The follo win g theorem pr o vides our main qualit y-of-appro ximation an d runn ing time r esult for Algorithm 2. Theorem 2. L et A b e a ful l-r ank n × d matrix, with n ≫ d ; let ǫ ∈ (0 , 1 / 2] b e an err or p ar ameter; let κ b e a p ar ameter; and r e c al l the deﬁnition of the cr oss-lever age sc or es c ij fr om Deﬁnition 1. Then, ther e exists a r andomize d algorith m (Algorithm 2 of Se ction 3 b elow) that r eturns the p airs { ( i, j ) } to gether with estimates { ˜ c ij } such that, with pr ob ability at le ast 0 . 8 , i. If c 2 ij ≥ d κ + 12 ǫℓ i ℓ j , then ( i, j ) is r eturne d; if ( i, j ) is r eturne d, then c 2 ij ≥ d κ − 30 ǫℓ i ℓ j . ii. F or al l p airs ( i, j ) that ar e r eturne d, ˜ c 2 ij − 30 ǫℓ i ℓ j ≤ c 2 ij ≤ ˜ c 2 ij + 12 ǫℓ i ℓ j . 2 Observe th at if U consists of d col umn s from the iden tity , th en the lev erage scores are extremely n onuniform: d of t h em are equal to one and the remaind er are equal to zero. On the other hand, if U consists of d columns from a normalized H adamard transform (see Section 2.3 for a deﬁnition), then th e leverage scores are very uniform: all n of th em are equal to d /n . 3 This algorithm runs in O ( ǫ − 2 n ln n + ǫ − 3 κd ln 2 n ) time. Note that b y setting κ = n ln n , w e can compu te all the “large” cross-lev erage scores, i. e . , those satisfying c 2 ij ≥ d n ln n , to within additive -error in O  nd ln 3 n  time (treating ǫ as a constant) . If ln 3 n = o ( d ) the o veral l r unnin g time is o ( nd 2 ), as desired. 1.3 Signiﬁcance and related work Our results are imp ortan t for their app lications to fast randomized matrix algorithms, as well as their applications in numerical linear algebra and large-scale d ata analysis more generally . Signiﬁcance in theoretical computer science. The statistical lev erage scores d eﬁne the key structural nonuniformit y that m ust b e dealt with ( i.e. , either rapidly app ro ximated or rapidly uniformized at the pr epro cessing step) in deve loping f ast randomized algorithms for matrix p roblems suc h as least-squares r egression [45, 22] and lo w-rank m atrix appro xima- tion [39, 45, 21 , 33, 12]. Roughly , th e b est random sampling algorithms u se these scores (or th e generalized lev erage scores relativ e to the b est r ank- k appr oximati on to A ) as an imp ortance sampling distribution to samp le with resp ect to. On the other hand, the b est random pro jectio n algorithms rotate to a basis where th ese scores are appro ximately un iform and thus in w hic h uniform samplin g is appropr iate. S ee [32] for a detailed discus s ion. As an example, the C UR d ecomp osition of [21, 33] essentia lly computes p i = ℓ i /k , for all i ∈ { 1 , . . . , n } and for a rank parameter k , and it uses these as an imp ortance sampling distri- bution. The compu tational b ottlenec k for th ese and related random sampling algorithms is the computation of the imp ortance sampling probabilities. On th e other hand, the computational b ottlenec k for random pro jection algorithms is the application of the random pro jection, wh ich is sp ed up by using v arian ts of th e F ast Johnson-Lind en strauss T ransform [2, 3]. By ou r main result, the leverag e scores (and th us these p robabilities) can b e app ro ximated in time that dep end s on an application of a F ast Johnson-Lindenstrauss T ransform. In particular, the rand om samp ling algorithms of [20, 21, 33] for least-squares approximat ion and low-rank matrix appro ximation no w run in time that is essenti ally the same as the b est corresp ond ing rand om p ro jection algorithm for those problems [45]. Applications to numerical linear algebra. Recently , high-qualit y n umerical implemen- tations of v arian ts of the basic rand omized matrix algorithms ha v e prov en sup erior to traditional deterministic algorithms [44, 43, 6]. An imp ortan t qu estion raised by our main r esults is h o w these will compare with an implemen tation of our main algorithm. More generally , densit y fun ctional theory [8] and u ncertain t y quantiﬁca tion [7] are tw o scienti ﬁc computing areas wher e computing the d iagonal element s of functions (such as a pro jection or in verse) of ve ry large input matrices is common. F or example, in the former case, “h eu ristic” metho ds based on using Ch eb yc hev p olynomials ha ve b een used in numerical linear algebra to compu te the diagonal elemen ts of the pro jector [8]. Our main algorithm sh ou ld ha ve implications in b oth of these areas. Applications in large-scale data a nalysis. The statistical lev erage scores and the s cores relativ e to the b est rank- k approxima tion to A are equal to the diagonal elemen ts of the so- called “hat matrix” [29, 15]. As su ch, they ha ve a natural statistical interpretatio n in terms of the “leve rage” or “inﬂuen ce” asso ciated with eac h of the data p oin ts [29, 14, 15]. In the con text of regression problems, the i th lev erage score quant iﬁes the lev erage or inﬂu ence of the i th constrain t/ro w of A on the s olution of the o ve rconstrained least squares optimization problem min x k Ax − b k 2 and the ( i, j )-th cr oss lev erage s core quan tiﬁes h ow m uch inﬂuence or lev erage the i th data p oin t h as on the j th least-squares ﬁt (see [29, 14, 15 ] for d etails). Wh en applied to low- rank matrix app r o ximation problems, the leve rage score ℓ j quan tiﬁes the amount of lev erage or inﬂuence exerted by the j th column of A on its optimal low-rank approximat ion. Historically , these quan tities ha v e b een widely-used for outlier iden tiﬁcation in diagnostic regression analysis [47, 16]. 4 More recen tly , these scores (u sually the largest scores) often h a v e an interpretatio n in terms of the d ata and pr o cesses generating the data that can b e exploited. F or exa mp le, dep ending on the setting, they can ha ve an in terp retation in terms of high-degree no des in data graphs, v ery sm all clusters in noisy data, coherence of information, articulatio n p oin ts b et w een clusters, the v alue of a customer in a net w ork, space lo calization in sensor net wo rks, etc. [9, 42 , 38 , 30, 32]. In genetics, dense matrices of size thousands b y hundreds of thousands (a size scale at wh ic h ev en tr aditional deterministic QR algorithms fail to ru n) constructed from DNA Single Nucleotide P olymorphisms (SNP) data are increasingly common, and the statistical lev erage scores can correlate strongly with other metrics of genetic inte rest [41, 33 , 19, 40]. Our main result will p ermit the computation of these scores and related quan tities for signiﬁcan tly larger S NP d ata sets than h as b een p ossible previously [41 , 19, 40, 24 ]. Remark. Lest there b e an y confusion, we sh ould emph asize our main con tribu tions. First, note that statistical lev erage and matrix coherence are imp ortan t concepts in statistics and ma- c hine learning. Second, recall that sev eral rand om sampling algorithms for ubiqu itous matrix problems s uc h as least-squares appr o ximation and lo w-rank matrix appro ximation u se lev erage scores in a crucial mann er; bu t u n til no w these alg orithms w ere Ω( T S V D ), wh ere T S V D is the time required to compute a QR decomp osition o r a partial S VD of the in put matrix. Thir d, note that, in some cases, o ( T S V D ) algorithms exist for these p r oblems based on fast random pro jections. But r ecall that the existence of those pro jection algorithms in no way i mplies that it is easy or ob vious h o w to compu te the statistic al lev erage scores eﬃcien tly . F ourth, one implication of our main result is that those random samp lin g algorithms can no w b e p erformed j ust as eﬃciently as those rand om pro jection algorithms; thus, the solution for those matrix problems can now b e obtained w hile p reserving the ident ity of the r o ws. Th at is, these problems can no w b e solv ed just as eﬃcien tly by using actual ro w s, rather than the arbitrary linear combinatio ns of r o ws that are retur ned by random pro jections. Fifth, we pro vide a generalization to “fat” matrices and to obtaining th e cross-lev erage scores. S ixth, we dev elop algorithms that can compute lev erage scores and related statistics ev en in streaming en vironments. 1.4 Empirical discussion of our algorithms Although the main contribution of our paper is to p ro vide a rigo rous theoretic al understandin g of fast lev erage score appr o ximation, our pap er do es analyze the theoretical p erform an ce of what is mean t to b e a practical algorithm. T hus, one might wonder ab out the empirical p erformance of our algorithms—for example, whether h idden constan ts render the algorithms useless for data of realistic size. Not sur prisingly , this dep end s hea vily on the qu alit y of the numerical implement a- tion, whether one is in terested in “tall” or more general matrices, etc. W e will consider empirical and numerical asp ects of these algorithms in f orthcoming pap ers, e . g. , [25]. W e will, ho w ev er, pro vide here a brief summary of sev eral n umerical issu es for the reader intereste d in these issues. Empirically , the r unnin g time b ottlenec k for our main algorithm (Algorithm 1 of Section 3) applied to “tall” matrices is the application of the random pro jection Π 1 . T h us, empirically th e runn in g time is similar to the running time of ran d om pr o jection based metho ds for computing appro ximations to the least-squares problem, which is also d ominated b y the application of th e random pro jection. The state of the art here is the Blendenpik algorithm of [6] and th e LSRN algorithm o f [34]. In their Blendenpik pap er, Avron, Ma ym ou n k o v, an d T oledo show ed that their high-qualit y n um er ical implementa tion of a Hadamard-based rand om pro jection (and asso ciated least-squares computation) “b eats Lap a ck ’s 3 direct dense least-squares solv er by a large m argin on essen tially an y dense tall matrix,” and they concluded that their empirical results “suggest 3 Lap ack (short fo r Linear Algebra P ACKage) is a high-quality and widely- used soft w are library of numerica l routines for solving a wide range of numerical linear algebra problems. 5 that random pro jection algorithms sh ou ld b e incorp orated in to fu ture versions of Lap ack ” [6]. The LSRN algorithm of Meng, S aunders, and Mahoney improv es Blendenpik in sev eral resp ects, e.g. , p ro viding b etter handling of sparsit y and rank deﬁciency , b u t most notably the random pro jection u nderlying LSRN is particularly appr op r iate f or solving large p r oblems on clusters with high comm unication cost, e.g. , it has b een sho wn to scale wel l on Amazon Elastic C loud Compute clusters. T h us, our main algorithm should extend easily to th ese en vironments w ith the u se of th e r andom pro jection underlying LS RN. Moreo ver, for b oth Blendenp ik and LSRN (when implemen ted with a Hadamard-based random p ro jection), the hidd en constan ts in the Hadamard-based random pro jection are so smal l that the random pro jection algorithm (and th us the emp irical ru nning time of our m ain alg orithm for app ro ximating lev erage scores) b eats the traditional O ( nd 2 ) time algorithm for d en se matrices as sm all as th ousands of rows b y hundreds of column s. 1.5 Outline In S ection 2, we will pro vide a brief review of relev an t notation and concepts from linear alge- bra. Then, in Sections 3 and 4, w e will pr esen t our main results: Section 3 will con tain our main algo rithm and S ection 4 w ill cont ain the pro of of our main theorem. Section 5 will then describ e extensions of our main r esu lt to general “fat” matrice s, i.e. , those with n ≈ d . Section 6 will conclude by d escrib ing the relationship of our main result with another related estimator for the statistical leverag e scores, an application of our main algorithm to the under-constrained least-squares appro ximation problem, and extensions of our main algorithm to streaming en vi- ronment s. 2 Preliminaries on linear algebra and fast random p ro jections 2.1 Basic linear algebra and not at ion Let [ n ] denote the set of int egers { 1 , 2 , . . . , n } . F or any matrix A ∈ R n × d , let A ( i ) , i ∈ [ n ], denote the i -th ro w of A as a r o w v ector, and let A ( j ) , j ∈ [ d ] denote the j -th column of A as a column v ector. Let k A k 2 F = P n i =1 P d j =1 A 2 ij denote th e square of the F rob en iu s n orm of A , and let k A k 2 = sup k x k 2 =1 k Ax k 2 denote the sp ectral norm of A . Relatedly , for an y vec tor x ∈ R n , its Euclidean norm (or ℓ 2 -norm) is th e squ are ro ot of the sum of the squ ares of its elemen ts. The dot pro du ct b et wee n t wo v ectors x, y ∈ R n will b e denoted h x, y i , or alternativ ely as x T y . Fin ally , let e i ∈ R n , for all i ∈ [ n ], denote the standard b asis ve ctors for R n and let I n denote the n × n iden tit y m atrix. Let the r ank of A b e ρ ≤ min { n, d } , in wh ic h case the “compact” or “thin” SVD of A is denoted b y A = U Σ V T , where U ∈ R n × ρ , Σ ∈ R ρ × ρ , and V ∈ R d × ρ . (F or a general m atrix X , w e will write X = U X Σ X V T X .) Let σ i ( A ) , i ∈ [ ρ ] denote the i -th singular v alue of A , and let σ max ( A ) and σ min ( A ) denote the maximum and minim um singular v alues of A , resp ectiv ely . T h e Mo ore-P enrose pseudoinv ers e of A is the d × n matrix deﬁned b y A † = V Σ − 1 U T [37]. Finally , for an y orthogo nal matrix U ∈ R n × ℓ , let U ⊥ ∈ R n × ( n − ℓ ) denote an orthogonal matrix whose columns are an orthonormal basis sp anning the subsp ace of R n that is orthogonal to the s u bspace spanned b y the columns of U ( i.e. , the range of U ). It is alw ays p ossible to extend an orthogonal matrix U to a full orthonorm al basis of R n as [ U U ⊥ ]. The S VD is imp ortan t for a num b er of reasons [26]. F or example, the pro jection of the columns of A ont o the k left sin gular v ectors asso ciated with the top k singular v alues giv es th e b est rank- k approximati on to A in th e sp ectral and F rob enius norms. Relatedly , the solution to the least-squares (LS ) approximati on problem is pr o vided by the SVD: giv en an n × d matrix 6 A and an n -vec tor b , the LS problem is to compute the minimum ℓ 2 -norm v ector x s u c h that k Ax − b k 2 is minimized o v er all v ectors x ∈ R d . This optimal v ector is giv en b y x opt = A † b . W e call a LS problem over c onstr aine d (or over determine d ) if n > d and under c onstr aine d (or under determine d ) if n < d . 2.2 The F ast Johnson-Lind enstrauss T ransform (FJL T) Giv en ǫ > 0 and a set of p oin ts x 1 , . . . , x n with x i ∈ R d , a ǫ -Johnson-Lind enstrauss T ransform ( ǫ -JL T), denoted Π ∈ R r × d , is a pro jection of the p oints in to R r suc h that (1 − ǫ ) k x i k 2 2 ≤ k Π x i k 2 2 ≤ (1 + ǫ ) k x i k 2 2 . (7) T o co nstru ct an ǫ -JL T with high p robabilit y , simply choose ev ery entry of Π indep endently , equal to ± p 3 /r w ith pr obabilit y 1 / 6 eac h and zero otherwise (with probabilit y 2 / 3) [1]. Let Π J LT b e a matrix drawn from suc h a distrib ution ov er r × d matrices. 4 Then, th e follo wing lemma h olds. Lemma 1 (Theorem 1.1 of [1]) . L et x 1 , . . . , x n b e an arbitr ary (but ﬁxe d) set of p oints, wher e x i ∈ R d and let 0 < ǫ ≤ 1 / 2 b e an ac cur acy p ar ameter. If r ≥ 1 ǫ 2  12 ln n + 6 ln 1 δ  then, with pr ob ability at le ast 1 − δ , Π J LT ∈ R r × d is an ǫ -JL T . F or our main results, we will also need a stronger r equiremen t than the simple ǫ -J L T and so w e will use a ve rsion of the F ast Johns on-Lindenstrauss T r ansform (FJL T), which w as originally in tro du ced in [2, 3]. Consider an orthogonal matrix U ∈ R n × d , view ed as d vecto rs in R n . A FJL T pro jects the v ectors from R n to R r , w h ile p r eserving the orthogonalit y of U ; moreo ver, it do es so very quic kly . S p eciﬁcally , giv en ǫ > 0, Π ∈ R r × n is an ǫ -FJL T f or U if •   I d − U T Π T Π U   2 ≤ ǫ , and • give n any X ∈ R n × d , th e matrix pro d uct Π X can b e computed in O ( nd ln r ) time. The n ext lemma follo ws from the deﬁn ition of an ǫ -FJL T, and its p ro of can b e found in [20 , 22]. Lemma 2. L et A b e any matrix in R n × d with n ≫ d and rank ( A ) = d . L et the SVD of A b e A = U Σ V T , let Π b e an ǫ -FJL T f or U (with 0 < ǫ ≤ 1 / 2 ) and let Ψ = Π U = U Ψ Σ Ψ V T Ψ . Then, al l the fol lowing hold: rank (Π A ) = rank (Π U ) = rank ( U ) = rank ( A ) = d, (8)   I − Σ − 2 Ψ   2 ≤ ǫ/ (1 − ǫ ) , and (9) (Π A ) † = V Σ − 1 (Π U ) † . (10) 2.3 The Subsamp led Randomized H adamard T ransform (SRHT) One can use a Randomized Hadamard T ransform (RHT) to construct, with high p robabilit y , an ǫ -FJL T. Our main algo rithm will use this eﬃcie nt construction in a crucial wa y . 5 Recall that the 4 When no confusion can arise, w e will use Π J LT to refer to this distribution ov er matrices as w ell as to a speciﬁc matrix drawn from this distribution. 5 Note that the RHT has also been crucia l in the development of o ( nd 2 ) randomized algorithms for the ge neral o verconstrained LS p roblem [22] and its v arian ts hav e b een used to provide high-quality numerical implementations of such randomized algorithms [44, 6 ]. 7 (unnormalized) n × n matrix of the Hadamard transform ˆ H n is deﬁn ed recurs iv ely by ˆ H 2 n =  ˆ H n ˆ H n ˆ H n − ˆ H n  , with ˆ H 1 = 1. The n × n normalized matrix of the Hadamard transform is equal to H n = ˆ H n / √ n. F rom now on, for simplicit y and without loss of generalit y , w e assume that n is a p o wer of 2 and w e will sup press n and jus t write H . (V arian ts of this basic construction that relax this assumption and that are more app ropriate for n umerical implemen tation hav e b een describ ed and ev aluated in [6].) Let D ∈ R n × n b e a rand om diagonal matrix w ith indep endent diagonal en tries D ii = +1 with probabilit y 1 / 2 and D ii = − 1 with p robabilit y 1 / 2. The p ro du ct H D is a RHT and it has three useful p rop erties. First, wh en app lied to a vec tor, it “spreads out” its energy . Second, computing the pr o duct H D x for an y vec tor x ∈ R n tak es O ( n log 2 n ) time. Third, if we only need to access r elemen ts in the transform ed v ector, then th ose r elemen ts can b e computed in O ( n log 2 r ) time [4]. T he S u bsampled Rand omized Hadamard T ransform (SRHT) randomly samp les (according to th e u niform distrib ution) a set of r ro ws of a RHT. Using th e sampling matrix f orm alism d escrib ed previously [18, 20, 21, 22], w e will r ep resen t the op eration of r andomly samplin g r ro ws of an n × d matrix A using an r × n linear sampling op erator S T . Let th e matrix Π F J LT = S T H D b e generated u sing the SR HT . 6 The most imp ortant prop erty ab out the distr ibution Π F J LT is that if r is large en ou gh , th en , with high probab ility , Π F J LT generates an ǫ -FJL T. W e summarize this discussion in the follo wing lemma (which is essen tially a combination of Lemmas 3 and 4 f rom [22], restated to ﬁt our notation). Lemma 3. L et Π F J LT ∈ R r × n b e gener ate d using the SRHT as describ e d ab ove and let U ∈ R n × d ( n ≫ d ) b e an (arbitr ary but ﬁxe d) ortho gonal matrix. If r ≥ 14 2 d ln(40 nd ) ǫ 2 ln  30 2 d ln(40 nd ) ǫ 2  , then, with pr ob ability at le ast 0 . 9 , Π F J LT is an ǫ -FJL T for U . 3 Our main algorit hmic results In this section, w e will d escrib e our main resu lts f or computing relativ e-error approximat ions to ev ery statistical lev erage score (see Algorithm 1) as well as additiv e-error appr o ximations to all of the large cross-le verag e scores (see Algorithm 2) of an arbitrary matrix A ∈ R n × d , with n ≫ d . Both algorithms m ak e u se of a “rand omized sk etc h” of A of the form A (Π 1 A ) † Π 2 , w here Π 1 is an ǫ -FJL T and Π 2 is an ǫ -JL T. W e start with a high-lev el d escription of the basic ideas. 3.1 Outline of our basic approac h Recall that our ﬁrst goal is to app ro ximate, for all i ∈ [ n ], the quantitie s ℓ i =   U ( i )   2 2 =   e T i U   2 2 , (11) 6 Again, when no confusion can arise, we will use Π F J LT to denote a speciﬁc SRHT or the distribution on matrices implied by the randomized p ro cess for constructing an SRH T. 8 where e i is a stand ard b asis vec tor. The hard part of compu ting the scores ℓ i according to Eqn. (11) is compu ting an orthogonal matrix U spanning the range of A , which takes O ( nd 2 ) time. S ince U U T = AA † , it follo ws that ℓ i =   e T i U U T   2 2 =    e T i AA †    2 2 =    ( AA † ) ( i )    2 2 , (12) where the ﬁrst equalit y follo w s from th e orthogonalit y of (the columns of ) U . The hard p art of computing the scores ℓ i according to Eqn. (12) is t w o-fold: ﬁrst, compu ting the pseudoinv erse; and second, p erforming the matrix-matrix multi plication of A and A † . Both of these pro cedur es tak e O ( nd 2 ) time. As w e will see, w e can get aroun d b oth of these b ottlenec ks b y th e ju dicious application of r andom pro jections to Eqn . (12). T o get aroun d th e b ottlenec k of O ( nd 2 ) time due to computing A † in Eqn. (12), we will compute the pseudoinv erse of a “smaller” m atrix that appro ximates A . A necessary cond ition for such a smaller matrix is that it preserv es r ank. So, na ¨ ıv e id eas suc h as uniformly s amp ling r 1 ≪ n rows from A and computing th e p seudoinv erse of th is sampled matrix will not work w ell for an arbitrary A . F or example, this idea will fail (with high p robabilit y) to return a meaningful appro ximation for matrices consisting of n − 1 ident ical r o ws and a sin gle ro w with a non zero comp onent in the direction p erp endicular to that the identic al rows; ﬁnding that “outlying” r o w is cr u cial to obtaining a relativ e-error appro ximation. This is where the SRHT en ters, since it preserve s imp ortant structures of A , in particular its rank, by ﬁrst r otating A to a rand om basis and then uniform ly sampling ro ws from th e rotated matrix (see [22] for more details). More formally , recall th at the S VD of A is U Σ V T and let Π 1 ∈ R r 1 × n b e an ǫ -FJL T for U (using, f or example, the SRHT of Lemma 3 with the appropr iate c hoice for r 1 ). Th en, one could approxi mate the ℓ i ’s of Eqn. (12) by ˆ ℓ i =    e T i A ( Π 1 A ) †    2 2 , (13) where we app ro ximated the n × d m atrix A by the r 1 × d matrix Π 1 A . Computing A (Π 1 A ) † in this w a y tak es O ( ndr 1 ) time, which is n ot eﬃcient b ecause r 1 > d (from Lemma 3). T o get arou n d this b ottlenec k , r ecall that we only need the Euclidean norms of the rows of the matrix A (Π 1 A ) † ∈ R n × r 1 . Thus, we can fur ther red uce th e d imensionalit y of this matrix by using an ǫ -JL T to r educe the dimension r 1 = Ω( d ) to r 2 = O (ln n ). Sp eciﬁcally , let Π T 2 ∈ R r 2 × r 1 b e an ǫ -JL T for th e ro ws of A (Π 1 A ) † (view ed as n vecto rs in R r 1 ) and consider th e matrix Ω = A (Π 1 A ) † Π 2 . T his n × r 2 matrix Ω may b e view ed as our “randomized ske tc h” of the r o ws of AA † . Then, w e can compute and return ˜ ℓ i =    e T i A (Π 1 A ) † Π 2    2 2 , (14) for eac h i ∈ [ n ], which is essentia lly w h at Algorithm 1 do es. Not surpr isingly , th e sk etc h A ( Π 1 A ) † Π 2 can b e us ed in other wa ys: for example, b y considering th e dot pro duct b et w een t w o d iﬀerent r o ws of this r andomized sket ching matrix (and some ad d itional manipu lations) Al- gorithm 2 appr o ximates the large cross-lev erage scores of A . 3.2 Appro ximating all the statistical lev erage scores Our ﬁr st main result is Algorithm 1, whic h tak es as input an n × d matrix A an d an err or parameter ǫ ∈ (0 , 1 / 2], and returns as output num b ers ˜ ℓ i , i ∈ [ n ]. Although the basic idea to appro ximate   ( AA † ) ( i )   2 w as describ ed in the previous section, we can improv e th e eﬃciency of our app roac h b y av oiding the f u ll sket ch of the pseudoinv ers e. In p articular, let ˆ A = Π 1 A and let its S VD b e ˆ A = U ˆ A Σ ˆ A V T ˆ A . Let R − 1 = V ˆ A Σ − 1 ˆ A and note that R − 1 ∈ R d × d is an orthogonalizer 9 Input: A ∈ R n × d (with SVD A = U Σ V T ), err or parameter ǫ ∈ (0 , 1 / 2]. Output: ˜ ℓ i , i ∈ [ n ]. 1. Let Π 1 ∈ R r 1 × n b e an ǫ -FJL T for U , u sing Lemma 3 w ith r 1 = Ω  d ln n ǫ 2 ln  d ln n ǫ 2  . 2. Comp u te Π 1 A ∈ R r 1 × d and its SVD, Π 1 A = U Π 1 A Σ Π 1 A V T Π 1 A . Let R − 1 = V Π 1 A Σ − 1 Π 1 A ∈ R d × d . (Alternativ ely , R could b e computed by a Q R f actorizat ion of Π 1 A .) 3. View the n ormalized rows of AR − 1 ∈ R n × d as n ve ctors in R d , and construct Π 2 ∈ R d × r 2 to b e an ǫ -JL T for n 2 v ectors (the aforementio ned n v ectors and their n 2 − n pairwise su ms), us in g Lemma 1 with r 2 = O  ǫ − 2 ln n  . 4. Constr uct the m atrix p r o duct Ω = AR − 1 Π 2 . 5. F or all i ∈ [ n ] compute and return ˜ ℓ i =   Ω ( i )   2 2 . Algorithm 1: Approximat ing the (diagonal) statistical lev erage scores ℓ i . for ˆ A since U ˆ A = ˆ AR − 1 is an orthogonal matrix. 7 In add ition, note that AR − 1 is ap p ro ximately orthogonal. Thus, we can compute AR − 1 and use it as an appr o ximate orthogonal basis for A and then compute ˆ ℓ i as the squared r o w-norms of AR − 1 . The next lemma states that this is exactly what our main algorithm do es; eve n more, w e could get the same estimates b y using any “orthogonaliz er” of Π 1 A . Lemma 4. L et R − 1 b e such that Q = Π 1 AR − 1 is an ortho gonal matrix with rank ( Q ) = rank (Π 1 A ) . Then,   ( AR − 1 ) ( i )   2 2 = ˆ ℓ i . Pr o of. S ince ˆ A = Π 1 A h as r ank d (b y L emm a 2) and R − 1 preserve s this rank, R − 1 is a d × d in ve rtible matrix. Using ˆ A = QR and prop erties of the p seudoinv erse, w e get  ˆ A  † = R − 1 Q T . Th us , ˆ ℓ i =    ( A (Π 1 A ) † ) ( i )    2 2 =     AR − 1 Q T  ( i )    2 2 =     AR − 1  ( i ) Q T    2 2 =     AR − 1  ( i )    2 2 . This lemma sa ys that the ˆ ℓ i of Eqn. (13) can b e compu ted with an y QR decomp osition, r ather than with the SVD; but note that one w ould s till ha v e to p ost-m ultiply by Π 2 , as in Algorithm 1, in order to compute “qu ic kly” the approxi mations of the leve rage scores. 7 This prepro cessing is reminiscent of ho w [44, 6] prepro cessed the input to provide numerical implemen tations of the fast relative-error algorithm [22] for approximate LS app ro ximation. F rom this p ersp ective, Algorithm 1 can b e viewed as sp ecifying a p articular basi s Q , i.e. , as choosing Q to be th e left singular vectors of Π 1 A . 10 3.3 Appro ximating the large cross-leve rage scores By com bining Lemmas 6 and 7 (in Section 4.2 b elo w) w ith the triangle inequalit y , o ne imm ediately obtains th e follo wing lemma. Lemma 5. L et Ω b e either the sk e tching matrix c onstructe d by Algorithm 1, i.e. , Ω = AR − 1 Π 2 , or Ω = A (Π 1 A ) † Π 2 as describ e d in Se ction 3.1. Then, the p airwise dot-pr o ducts of the r ows of Ω ar e additive-err or appr oximations to the lever age sc or e s and cr oss-lever age sc or es:   h U ( i ) , U ( j ) i − h Ω ( i ) , Ω ( j ) i   ≤ 3 ǫ 1 − ǫ   U ( i )   2   U ( j )   2 . That is, if one were interested in obtaining an appro ximation to all the cross-lev erage scores to within additiv e error (and th us the diag onal statistical lev erage scores to relativ e-error), then the algorithm which ﬁrst computes Ω f ollo wed by all the pairwise in ner pro d ucts ac h iev es this in time T (Ω) + O  n 2 r 2  , w h ere T (Ω) is the time to compute Ω f rom Section 3.2 and r 2 = O ( ǫ − 2 ln n ). 8 The challe nge is to av oid the n 2 computational complexity and this can b e d one if one is intereste d only in the large cross-lev erage scores. Our second main result is pro vided by Algorithms 2 and 3. Algorithm 2 tak es as input an n × d matrix A , a parameter κ > 1, and an error parameter ǫ ∈ (0 , 1 / 2], and retur ns as output a subs et of [ n ] × [ n ] and estimates ˜ c ij satisfying Theorem 2. T he ﬁrst step of the algorithm is to compu te the matrix Ω = AR − 1 Π 2 constructed b y Algorithm 1 . Then , Algorithm 2 u ses Algorithm 3 as a subroutine to compu te “hea vy hitter” pairs of r o ws f rom a matrix. Input: A ∈ R n × d and p arameters κ > 1, ǫ ∈ (0 , 1 / 2]. Output: Th e set H consisting of p airs ( i, j ) toget her with estimates ˜ c ij satisfying Theorem 2. 1. Comp u te the n × r 2 matrix Ω = AR − 1 Π 2 from Algorithm 1. 2. Use Algorithm 3 with inputs Ω and κ ′ = κ (1 + 30 dǫ ) to obtain the s et H con taining all th e κ ′ -hea vy pairs of Ω. 3. Return the pairs in H as th e κ -hea vy pairs of A . Algorithm 2: Approximat ing the large (oﬀ-diagonal) cross-lev erage s cores c ij . 4 Pro ofs of our main theorems 4.1 Sk etc h of the pro of of Theorems 1 and 2 W e will start by p ro viding a sk etc h of the p ro of of Theorems 1 and 2. A detailed pr o of is pr o vided in the next t w o subsections. In our analysis, we w ill condition on the ev en ts that Π 1 ∈ R r 1 × n is an ǫ -FJL T for U and Π 2 ∈ R r 1 × r 2 is an ǫ -JL T f or n 2 p oints in R r 1 . Note that b y setting δ = 0 . 1 in Lemma 1, b oth ev ents hold with probabilit y at least 0 . 8, which is equal to the s uccess p robabilit y of Theorems 1 and 2. The algorithm estimates ˜ ℓ i = k ˜ u i k 2 2 , wh ere ˜ u i = e T i A (Π 1 A ) † Π 2 . First, 8 The exact algorithm which compu tes a basis ﬁrst and then th e pairwise inner pro ducts requires O ( nd 2 + n 2 d ) time. Thus, b y using the sk etch, we can already improve on t his runnin g time b y a factor o f d / ln n . 11 Input: X ∈ R n × r with rows x 1 , . . . , x n and a parameter κ > 1. Output: H = { ( i, j ) , ˜ c ij } con taining all heavy (unordered) p airs. Th e pair ( i, j ) , ˜ c ij ∈ H if and only if ˜ c 2 ij = h x i , x j i 2 ≥   X T X   2 F /κ . 1: Comp u te the norms k x i k 2 and sort the ro ws according to n orm, so th at k x 1 k 2 ≤ · · · ≤ k x n k 2 . 2: H ← {} ; z 1 ← n ; z 2 ← 1. 3: while z 2 ≤ z 1 do 4: w hile k x z 1 k 2 2 k x z 2 k 2 2 <   X T X   2 F /κ do 5: z 2 ← z 2 + 1. 6: if z 2 > z 1 then 7: return H . 8: end if 9: end while 10: for ea ch pair ( i, j ) where i = z 1 and j ∈ { z 2 , z 2 + 1 , . . . , z 1 } do 11: ˜ c 2 ij = h x i , x j i 2 . 12: if ˜ c 2 ij ≥   X T X   2 F /κ then 13: add ( i, j ) and ˜ c ij to H . 14: end if 15: z 1 ← z 1 − 1. 16: end for 17: e nd while 18: re t urn H . Algorithm 3: Computing hea vy pairs of a matrix. observ e that the sole purp ose of Π 2 is to impro v e the runn ing time wh ile p reserving pairwise inner pro du cts; this is ac hiev ed b ecause Π 2 is an ǫ -JL T for n 2 p oints. S o, the results w ill follo w if e T i A (Π 1 A ) † ((Π 1 A ) † ) T A T e j ≈ e T i U U T e j and (Π 1 A ) † can b e computed eﬃcien tly . Since Π 1 is an ǫ -FJL T for U , where A = U Σ V T , (Π 1 A ) † can b e computed in O ( n d ln r 1 + r 1 d 2 ) time. By L emm a 2, (Π 1 A ) † = V Σ − 1 (Π 1 U ) † , and so e T i A (Π 1 A ) † ((Π 1 A ) † ) T A T e j = e T i U (Π 1 U ) † (Π 1 U ) † T U T e j . Since Π 1 is an ǫ -FJL T for U , it follo ws th at (Π 1 U ) † (Π 1 U ) † T ≈ I d , i.e . , that Π 1 U is appro ximately orthogonal. Th eorem 1 follo ws from th is basic idea. How eve r, in order to p ro v e T heorem 2, ha ving a sk etc h whic h preserves inner pro d ucts alone is not s uﬃcien t. W e also n eed a fast algorithm to iden tify the large inn er pr o ducts and to relate these to the actual cross-leve rage scores. Indeed, it is p ossible to eﬃcient ly ﬁnd pairs of ro ws in a general m atrix with large inn er pro ducts. Combining this with the fact that the inner pro ducts are preserved, w e obtain Th eorem 2. 4.2 Pro of of Theorem 1 W e condition all our analysis on the ev en ts that Π 1 ∈ R r 1 × n is an ǫ -FJL T for U and Π 2 ∈ R r 1 × r 2 is an ǫ -JL T for n 2 p oints in R r 1 . Deﬁne ˆ u i = e T i A (Π 1 A ) † , and ˜ u i = e T i A (Π 1 A ) † Π 2 . 12 Then, ˆ ℓ i = k ˆ u i k 2 2 and ˜ ℓ i = k ˜ u i k 2 2 . The p ro of w ill follo w from the follo wing t wo lemmas. Lemma 6. F or i, j ∈ [ n ] ,   h U ( i ) , U ( j ) i − h ˆ u i , ˆ u j i   ≤ ǫ 1 − ǫ   U ( i )   2   U ( j )   2 . (15) Lemma 7. F or i, j ∈ [ n ] , |h ˆ u i , ˆ u j i − h ˜ u i , ˜ u j i| ≤ 2 ǫ k ˆ u i k 2 k ˆ u j k 2 . (16) Lemma 6 states that h ˆ u i , ˆ u j i is an additiv e error appro ximation to all the cross-lev erage scores ( i 6 = j ) and a r elativ e error appro ximation for the diagonals ( i = j ). Similarly , Lemma 7 sho ws that these cross-lev erage scores are preserved b y Π 2 . Indeed, with i = j , from Lemma 6 we ha v e | ˆ ℓ i − ℓ i | ≤ ǫ 1 − ǫ ℓ i , and from L emma 7 w e ha v e | ˆ ℓ i − ˜ ℓ i | ≤ 2 ǫ ˆ ℓ i . Using the triangle inequalit y and ǫ ≤ 1 / 2:    ℓ i − ˜ ℓ i    =    ℓ i − ˆ ℓ i + ˆ ℓ i − ˜ ℓ i    ≤    ℓ i − ˆ ℓ i    +    ˆ ℓ i − ˜ ℓ i    ≤  ǫ 1 − ǫ + 2 ǫ  ℓ i ≤ 4 ǫℓ i . The theorem follo ws after rescaling ǫ . Pro of of Lemma 6. Let A = U Σ V T . Using this SVD of A and Eqn . (10) in Lemma 2, h ˆ u i , ˆ u j i = e T i U Σ V T V Σ − 1 (Π 1 U ) † (Π 1 U ) † T Σ − 1 V T V Σ U T e j = e T i U (Π 1 U ) † (Π 1 U ) † T U T e j . By p erforming standard manipulations, we can no w b oun d   h U ( i ) , U ( j ) i − h ˆ u i , ˆ u j i   :   h U ( i ) , U ( j ) i − h ˆ u i , ˆ u j i   = e T i U U T e j − e T i U (Π 1 U ) † (Π 1 U ) † T U T e j = e T i U  I d − (Π 1 U ) † (Π 1 U ) † T  U T e j ≤    I d − (Π 1 U ) † (Π 1 U ) † T    2   U ( i )   2   U ( j )   2 . Let the SVD of Ψ = Π 1 U b e Ψ = U Ψ Σ Ψ V T Ψ , where V Ψ is a full r otation in d dimensions (b ecause rank ( A ) = rank (Π 1 U )). Then, Ψ † Ψ † T = V Ψ Σ − 2 Ψ V T Ψ . Thus,   h U ( i ) , U ( j ) i − h ˆ u i , ˆ u j i   ≤   I d − V Ψ Σ − 2 Ψ V T Ψ   2   U ( i )   2   U ( j )   2 =   V Ψ V T Ψ − V Ψ Σ − 2 Ψ V T Ψ   2   U ( i )   2   U ( j )   2 =   I d − Σ − 2 Ψ   2   U ( i )   2   U ( j )   2 , where we used the f act that V Ψ V T Ψ = V T Ψ V Ψ = I d and the unitary in v ariance of the sp ectral norm. Finally , using Eqn. (9) of Lemma 2 the result f ollo w s. Pro of of Lemma 7. Since Π 2 is an ǫ -JL T for n 2 v ectors, it preserves the n orm s of an arbitrary (but ﬁ x ed ) collection of n 2 v ectors. L et x i = ˆ u i / k ˆ u i k 2 . Consid er the follo wing n 2 v ectors: x i for i ∈ [ n ] , and x i + x j for i, j ∈ [ n ] , i 6 = j. By the ǫ -JL T p r op erty of Π 2 and the fact that k x i k 2 = 1, 1 − ǫ ≤ k x i Π 2 k 2 2 ≤ 1 + ǫ for i ∈ [ n ] , and (17) (1 − ǫ ) k x i + x j k 2 2 ≤ k x i Π 2 + x j Π 2 k 2 2 ≤ (1 + ǫ ) k x i + x j k 2 2 for i, j ∈ [ n ] , i 6 = j. (18) 13 Com bining Eqn s. (17) and (1 8) after expand in g the squares using th e identit y k a + b k 2 = k a k 2 + k b k 2 + 2 h a, b i , substituting k x i k = 1, and after some algebra, we obtain h x i , x j i − 2 ǫ ≤ h x i Π 2 , x j Π 2 i ≤ h x i , x j i + 2 ǫ. T o conclude the pro of, m ultiply throughout by k ˆ u i kk ˆ u j k and use the homogeneit y of the inner pro du ct, together with the linearit y of Π 2 , to obtain: h ˆ u i , ˆ u j i − 2 ǫ k ˆ u i kk ˆ u j k ≤ h ˆ u i Π 2 , ˆ u j Π 2 i ≤ h ˆ u i , ˆ u j i + 2 ǫ k ˆ u i kk ˆ u j k . Running Times. By Lemma 4, we can use V Π 1 A Σ − 1 Π 1 A instead of (Π 1 A ) † and obtain the same estimates. Sin ce Π 1 is an ǫ -FJL T, the pro du ct Π 1 A can b e computed in O ( nd ln r 1 ) wh ile its SVD tak es an additional O ( r 1 d 2 ) time to return V Π 1 A Σ − 1 Π 1 A ∈ R d × d . Since Π 2 ∈ R d × r 2 , w e obtain V Π 1 A Σ − 1 Π 1 A Π 2 ∈ R d × r 2 in an additional O ( r 2 d 2 ) time. Finally , premultiplying b y A ta ke s O ( ndr 2 ) time, and compu ting and returnin g the squ ared row-norms of Ω = AV Π 1 A Σ − 1 Π 1 A Π 2 ∈ R n × r 2 tak es O ( n r 2 ) time. So, the total runn ing time is the sum of all these op erations, which is O ( nd ln r 1 + ndr 2 + r 1 d 2 + r 2 d 2 ) . F or our implemen tations of the ǫ -JL Ts and ǫ -FJL Ts ( δ = 0 . 1), r 1 = O  ǫ − 2 d ( ln n )  ln  ǫ − 2 d ln n  and r 2 = O ( ǫ − 2 ln n ). It follo ws that the asymp totic ru nning time is O  nd ln  dǫ − 1  + ndǫ − 2 ln n + d 3 ǫ − 2 (ln n )  ln  dǫ − 1  . T o simp lify , sup p ose that d ≤ n ≤ e d and treat ǫ as a constant . Then , the asymptotic run ning time is O  nd ln n + d 3 (ln n ) (ln d )  . 4.3 Pro of of Theorem 2 W e ﬁrst construct an algorithm to estimate the large inner p ro du cts among the rows of an arbitrary matrix X ∈ R n × r with n > r . This general algorithm will b e app lied to the matrix Ω = AV Π 1 A Σ − 1 Π 1 A Π 2 . Let x 1 , . . . , x n denote the ro ws of X ; for a giv en κ > 1, the pair ( i, j ) is he avy if h x i , x j i 2 ≥ 1 κ   X T X   2 F . By the Cauch y-Sc h warz inequalit y , this implies that k x i k 2 2 k x j k 2 2 ≥ 1 κ   X T X   2 F , (19) so it suﬃ ces to ﬁnd all the pairs ( i, j ) for whic h Eqn. (19) holds. W e will call suc h pairs norm- he avy . Let s b e the num b er of n orm-hea vy pairs satisfying Eqn. (19). W e ﬁrst b ound the num b er of su c h pairs. Lemma 8. Using the ab ove notation, s ≤ κr . Pr o of. O bserve that n X i,j =1 k x i k 2 2 k x j k 2 2 = n X i =1 k x i k 2 2 ! 2 = k X k 4 F = r X i =1 σ 2 i ! 2 , 14 where σ 1 , . . . , σ r are the singular v alues of X . T o conclude, b y the deﬁn ition of a h ea vy pair, X i,j k x i k 2 2 k x j k 2 2 ≥ s κ   X T X   2 F = s κ r X i =1 σ 4 i ≥ s κr r X i =1 σ 2 i ! 2 , where the last inequalit y follo ws by Cauc hy-Sc hw arz. Algorithm 3 starts by computing the norms k x i k 2 2 for all i ∈ [ n ] and sorting them (in O ( nr + n ln n ) time) so that we can assume that k x 1 k 2 ≤ · · · ≤ k x n k 2 . Th en, w e initializ e the set of norm-hea vy pairs to H = {} and we also initialize t w o p ointers z 1 = n and z 2 = 1. The basic lo op in the algorithm c hecks if z 2 > z 1 and stops if th at is the case. O therwise, we incremen t z 2 to the ﬁrs t pair ( z 1 , z 2 ) that is norm-heavy . If none of p airs are norm hea vy ( i.e., z 2 > z 1 o ccurs), then w e stop and output H ; otherwise, w e add ( z 1 , z 2 ) , ( z 1 , z 2 + 1) , . . . , ( z 1 , z 1 ) to H . This basic lo op computes all pairs ( z 1 , i ) with i ≤ z 1 that are norm-hea vy . Next, w e decrease z 1 b y one and if z 1 < z 2 w e stop and output H ; otherwise, w e rep eat the b asic lo op. Note that in th e basic lo op z 2 is alw a ys incr emente d . T his o ccurs whenever the pair ( z 1 , z 2 ) is n ot norm-hea vy . Since z 2 can b e incremen ted at most n times, the num b er of times w e c h eck wh ether a pair is norm-hea vy and fail is at most n . Every successful c hec k resu lts in the addition of at least one n orm-hea vy pair in to H and th us the num b er of times w e chec k if a pair is norm heavy (a constan t-time operation) is at most n + s . The n umber of p air additions in to H is exactly s and thus the total runnin g time is O ( nr + n ln n + s ). Finally , we m ust chec k eac h norm -hea vy pair to v erify w hether or not it is ac tually hea vy by compu ting s inn er pro du cts vecto rs in R r ; this can b e don e in O ( sr ) time. Using s ≤ κr we get the follo wing lemma. Lemma 9. Algorith m 3 r eturns H including al l the he avy p airs of X in O ( nr + κr 2 + n ln n ) time. T o complete the pro of, we apply Algorithm 3 with Ω = AV Π 1 A Σ − 1 Π 1 A Π 2 ∈ R n × r 2 , where r 2 = O ( ǫ − 2 ln n ). Let ˜ u 1 , . . . , ˜ u n denote the ro ws of Ω and recall that A = U Σ V T . Let u 1 , . . . , u n denote the rows of U ; then, from Lemma 5, h u i , u j i − 3 ǫ 1 − ǫ k u i kk u j k ≤ h ˜ u i , ˜ u j i ≤ h u i , u j i + 3 ǫ 1 − ǫ k u i kk u j k . (20) Giv en ǫ, κ , assume that f or the pair of ve ctors u i and u j h u i , u j i 2 ≥ 1 κ   U T U   2 F + 12 ǫ k u i k 2 k u j k 2 = d κ + 12 ǫ k u i k 2 k u j k 2 , where the last equalit y follo ws from   U T U   2 F = k I d k 2 F = d . By Eqn. (20), after squarin g and using ǫ < 0 . 5, h u i , u j i 2 − 12 ǫ k u i k 2 ǫ k u j k 2 ≤ h ˜ u i , ˜ u j i 2 ≤ h u i , u j i 2 + 30 ǫ k u i k 2 k u j k 2 . (21) Th us , h ˜ u i , ˜ u j i 2 ≥ d/κ and summing Eq n . (21 ) ov er all i, j w e get   Ω T Ω   2 F ≤ d + 30 ǫd 2 , or, equiv alent ly , d ≥   Ω T Ω   2 F 1 + 30 dǫ . W e conclude that h u i , u j i 2 ≥ d κ + 12 ǫ k u i k 2 k u j k 2 = ⇒ h ˜ u i , ˜ u j i 2 ≥ d κ ≥   Ω T Ω   2 F κ (1 + 30 dǫ ) . (22) 15 By construction, Algorithm 3 is inv ok ed with κ ′ = κ   Ω T Ω   2 F /d and th us it ﬁnds all p airs with h ˜ u i , ˜ u j i 2 ≥   Ω T Ω   2 F /κ ′ = d/κ . Th is set conta ins all pairs for w h ic h h u i , u j i 2 ≥ d κ + 12 ǫ k u i k 2 k u j k 2 . F urther, since ev ery pair returned satisﬁes h ˜ u i , ˜ u j i 2 ≥ d/κ , by Eqn. (21), c ij ≥ d/κ − 30 ǫℓ i ℓ j . Th is pro ve s the ﬁr st claim of th e Theorem; th e second claim follo ws analogously from Eqn . (21). Using Lemma 9, the r unnin g time of our approac h is O  nr 2 + κ ′ r 2 2 + n ln n  . S ince r 2 = O  ǫ − 2 ln n  , and, by Eqn. (22), κ ′ = κ   Ω T Ω   2 F /d ≤ κ (1 + 30 dǫ ), the o ve rall run ning time is O  ǫ − 2 n ln n + ǫ − 3 κd ln 2 n  . 5 Extending our algorithm to general matrices In this section, we will describ e an imp ortan t extension of our main result, namely the compu- tation of th e statistical leverag e scores relativ e to the b est rank- k appr oximati on to a general matrix A . More s p eciﬁcally , we consider the estimation of leverag e scores for the case of general “fat” matrices, namely inpu t matrices A ∈ R n × d , where b oth n and d are large, e.g. , when d = n or d = Θ ( n ). Clearly , the leverag e scores of an y f ull rank n × n matrix are exactly uniform. T h e problem b ecomes inte resting if one sp eciﬁes a r ank parameter k ≪ min { n, d } . Th is ma y arise when th e n umerical rank of A is small ( e.g. , in some scie ntiﬁc co mpu ting app lications, more than 99% of the sp ectral norm of A ma y b e captured b y some k ≪ min { n, d } directions), or, m ore generally , when one is inte rested in some lo w rank appro ximation to A ( e.g. , in some d ata an al- ysis applications, a reasonable fraction or eve n the ma jority of the F rob enius n orm of A may b e captured by some k ≪ min { n, d } directions, wh er e k is determined by some exogenously-sp eciﬁed mo del selectio n criterion). Thus, assume that in addition to a general n × d matrix A , a rank parameter k < min { n, d } is sp eciﬁed. In this case, w e wish to obtain th e statistical lev erage scores ℓ i =   ( U k ) ( i )   2 2 for A k = U k Σ k V T k , the b est rank- k approximat ion to A . Equiv alen tly , we seek the norm alized lev erage scores p i = ℓ i k . (23) Note that P n i =1 p i = 1 since P n i =1 ℓ i = k U k k 2 F = k . Unfortunately , as stated, this is an ill-posed problem. Indeed, consider the degenerate case when A = I n ( i.e. , the n × n iden tit y matrix). In this case, U k is not unique and the leve rage scores are not w ell-deﬁned. Moreo ve r, for the ob vious  n k  equiv alent c hoices for U k , the lev erage scores deﬁ ned according to an y one of these c h oices d o not pro vide a relativ e error appr o ximation to the lev erage scores deﬁned acc ordin g to an y other choic es. More generally , removing this trivial degeneracy do es n ot help. Consider the matrix A =  I k 0 0 (1 − γ ) I n − k  ∈ R n × n . In this example, the lev er age scores for A k are w ell deﬁned. Ho w eve r, as γ → 0, it is not p ossible to distin gu ish b et we en the top- k singular space and its complement. This example suggests that it should b e p ossible to ob tain some result conditioning on the sp ectral gap at the k th singular v alue. F or example, one might assume that σ 2 k − σ 2 k +1 ≥ γ > 0, in which case the parameter γ w ould pla y an imp ortant role in the abilit y to solv e th is problem. Any algorithm wh ich cann ot distinguish the singular v alues with an error less than γ will confu se the k -th and ( k + 1)-th singular vecto rs and consequently will fail to get an accurate app ro ximation to the lev erage scores for A k . 16 In the follo wing, we tak e a more natural approac h whic h leads to a clean problem formulat ion. T o do so, recall that the lev erage scores and the related n orm alized lev erage scores of Eqn. (23) are used to appro ximate the matrix in some w a y , e.g. , w e might b e seeking a lo w-rank appro ximation to the matrix with resp ect to the sp ectral [21] or the F rob enius [12] norm, or we migh t b e seeking useful features or d ata p oints in downstream data an alysis applications [41, 33], or we migh t b e seeking to deve lop h igh-qualit y n umer ical imp lemen tations of low-rank matrix approximat ion algorithms [2 7], etc. In all these cases, w e only ca re that the estimated lev erage scores are a goo d appro ximation to the lev erage sco res of some “go o d” lo w-rank appro ximation to A . The follo wing deﬁnition captur es th e notion of a set of r ank- k matrices th at are go o d app ro ximations to A . Deﬁnition 2. Given A ∈ R n × d and a r ank p ar ameter k ≪ min { n, d } , let A k b e the b est r ank- k appr oximation to A . De ﬁne the set S ǫ of r ank- k matric es that ar e go o d appr oximations to A as fol lows (for ξ = 2 , F ): S ǫ = n X ∈ R n × d : rank ( X ) = k and k A − X k ξ ≤ (1 + ǫ ) k A − A k k ξ o . (24) W e are no w ready to d eﬁ ne our approximat ions to the normalized lev erage scores of an y matrix A ∈ R n × d giv en a ran k parameter k ≪ min { n, d } . I nstead of seeking to approxima te the p i of Eqn. (23) (a problem that is ill-p osed as discussed ab o v e), w e w ill b e satisﬁed if w e can appro ximate the normalized leverag e scores of some matrix X ∈ S ǫ . Th is is an in teresting relaxation of the task at hand: all matrices X that are suﬃ ciently close to A k are essentia lly equiv alent , s in ce they can b e u sed in stead of A k in applications. Deﬁnition 3. Given A ∈ R n × d and a r ank p ar ameter k ≪ m in { n, d } , let S ǫ b e the set of ma tric es of Deﬁnition 2. We c al l the numb ers ˆ p i (for al l i ∈ [ n ] ) β -appr oximations to the normalize d lever age sc or e s of A k (the b est r ank- k app r oximation to A ) if, for some matrix X ∈ S ǫ , ˆ p i ≥ β   ( U X ) ( i )   2 2 k and n X i =1 ˆ p i = 1 . Her e U X ∈ R n × k is the matrix of the left singular ve ctors of X . Th us , we will seek algo rithm s whose outpu t is a set of n um b ers, with the requiremen t that those n umb ers are go o d appr o ximations to the norm alized lev erage scores of some matrix X ∈ S ǫ (instead of A k ). This remov es the ill-p osedness of the original pr ob lem. Next, we will giv e t w o examples of algorithms that co mpu te suc h β -appr o ximations to th e normalized lev erage scores of a general matrix A with a rank parameter k f or tw o p opu lar norms, the sp ectral norm and the F rob enius norm. 9 5.1 Lev erage Scores for Sp ect ral Norm Approxim ators Algorithm 4 appro ximates the statistica l lev erage scores of a general m atrix A with rank parameter k in the s p ectral norm case. I t take s as inpu ts a matrix A ∈ R n × d with rank ( A ) = ρ and a r an k parameter k ≪ ρ , and outp uts a set of num b ers ˆ p i for all i ∈ [ n ], namely our app ro ximations to the n ormalized leve rage scores of A with rank p arameter k . The next lemma argues that there exists a matrix X ∈ R n × d of rank k that is suﬃcien tly close to A (in p articular, it is a mem b er of S ǫ with constan t probabilit y) and, add itionally , can b e written as X = B Y , where Y ∈ R 2 k × d is a matrix of rank k . A version of this lemma wa s 9 Note th at we will n ot compu te S ǫ , but our algorithms will compute a matrix in that set. Moreo ver, that matrix can b e u sed for high-q ualit y low -rank matrix approximati on. See the comments in Section 1.4 for more details. 17 Input: A ∈ R n × d with rank ( A ) = ρ and a r ank parameter k ≪ ρ Output: ˆ p i , i ∈ [ n ] 1. Constr uct Π ∈ R d × 2 k with entries drawn in i.i.d. trials fr om the normal distribution N (0 , 1). 2. Comp u te B =  AA T  q A Π ∈ R n × 2 k , w ith q as in Eqn . (26). 3. App ro ximately compu te the statistical lev erage scores of th e “tall” matrix B by calling Algorithm 1 with inputs B and ǫ ; let ˆ ℓ i (for all i ∈ [ n ]) b e the outputs of Algorithm 1. 4. Return ˆ p i = ˆ ℓ i P n j =1 ˆ ℓ j (25) for all i ∈ [ n ]. Algorithm 4: Appr o ximating the statistica l lev erage s cores of a general matrix A (sp ectral norm case). essen tially pro v en in [27], b ut see also [43] for computational details; w e will use the version of the lemma that app eared in [10]. (See also the conference version [11], but in the remainder w e refer to the tec hnical rep ort v ersion [10 ] for consistency of num b ering.) Note that for our p urp oses in this section, the computation of Y is n ot r elev an t and w e defer the reader to [27 , 10] for details. Lemma 10 (Sp ectral S k etc h) . Give n A ∈ R n × d of r ank ρ , a r ank p ar ameter k such that 2 ≤ k < ρ , and an err or p ar ameter ǫ such that 0 < ǫ < 1 , let Π ∈ R d × 2 k b e a standar d Gaussian matrix (with entries sele cte d in i.i.d. trials fr om N (0 , 1) ). If B =  AA T  q A Π , wher e q ≥      ln  1 + q k k − 1 + e q 2 k p min { n , d } − k  2 ln (1 + ǫ/ 10) − 1 / 2      , (26) then ther e e xists a matrix X ∈ R n × d of r ank k satisfying X = B Y (with Y ∈ R 2 k × d ) suc h that E [ k A − X k 2 ] ≤  1 + ǫ 10  k A − A k k 2 . The matrix B c an b e c ompute d in O ( ndk q ) time. This v ersion of the abov e lemma is pro v en in [10]. 10 No w, s in ce X has rank k , it follo ws that k A − X k 2 ≥ k A − A k k 2 and th us we can consider the non-negativ e rand om v ariable k A − X k 2 − k A − A k k 2 and app ly Mark ov’ s inequalit y to get that k A − X k 2 − k A − A k k 2 ≤ ǫ k A − A k k 2 holds with prob ab ility at least 0 . 9. Thus, X ∈ S ǫ with pr obabilit y at least 0 . 9. 10 More sp eciﬁcally , t h e pro of may b e found in Lemma 32 and in particular in Eqn. (14) in Section A.2; note that for our p urp oses here we replaced ǫ/ √ 2 by ǫ/ 10 af ter adjusting q accordingly . 18 The next step of the prop osed algorithm is to approximat ely compute the lev erage scores of B ∈ R n × 2 k via Algorithm 1. Und er the assump tions of Theorem 1, this s tep ru ns in O  nk ǫ − 2 ln n  time. Let U X ∈ R n × k b e the matrix con taining the left singular v ectors of th e matrix X of Lemma 10. Then, s in ce X = B Y b y Lemma 10, it follo ws that U B = [ U X U R ] is a basis for the subspace spanned by the columns of B . Here U R ∈ R n × k is an orthogo nal matrix whose columns are p erp endicular to th e columns of U X . No w consider the approximat e leverag e scores ˆ ℓ i computed by Algorithm 1 and note that (by Theorem 1),     ˆ ℓ i −    ( U B ) ( i )    2 2     ≤ ǫ    ( U B ) ( i )    2 2 holds with prob ab ility at least 0 . 8 for all i ∈ [ n ]. It follo ws that n X j =1 ˆ ℓ j ≤ (1 + ǫ ) n X j =1    ( U B ) ( j )    2 2 = (1 + ǫ ) n X j =1 k U B k 2 F = 2 ( 1 + ǫ ) k . Finally , ˆ p i = ˆ ℓ i P n j =1 ˆ ℓ j ≥ (1 − ǫ )    ( U B ) ( i )    2 2 P n j =1 ˆ ℓ j ≥ (1 − ǫ )    ( U X ) ( i )    2 2 +    ( U R ) ( i )    2 2 P n j =1 ˆ ℓ j ≥ 1 − ǫ 2    ( U X ) ( i )    2 2 P n j =1 ˆ ℓ j ≥ 1 − ǫ 2 (1 + ǫ )    ( U X ) ( i )    2 2 k . Clearly ,    ( U X ) ( i )    2 2 /k are the normalized lev erage scores of the matrix X . Recall that X ∈ S ǫ with probability at least 0 . 9 and use Deﬁnition 3 to conclude that the scores ˆ p i of Eqn. (25 ) are  1 − ǫ 2(1+ ǫ )  -appro ximations to the normalized lev erage scores of A with rank parameter k . The follo wing Theorem summarizes the ab o ve d iscussion: Theorem 3. Given A ∈ R n × d , a r ank p ar ameter k , and an ac cur acy p ar ameter ǫ , Algorithm 4 c omputes a set of norma lize d lever age sc or es ˆ p i that ar e  1 − ǫ 2(1+ ǫ )  -appr oximations to the norm alize d lever age sc or es of A with r ank p ar ameter k with pr ob ability at le ast 0 . 7 . The pr op ose d algorithm runs in O  ndk ln (min { n, d } ) ln (1 + ǫ ) + nk ǫ − 2 ln n  time. 19 Input: A ∈ R n × d with rank ( A ) = ρ and a r ank parameter k ≪ ρ Output: ˆ p i , i ∈ [ n ] 1. Let r b e as in E qn. (28) and constru ct Π ∈ R d × r whose entries are d ra wn in i.i.d. trials fr om the n ormal d istribution N (0 , 1). 2. Comp u te B = A Π ∈ R n × r . 3. Comp u te a matrix Q ∈ R n × r whose columns form an orthonormal basis for the column sp ace of B . 4. Comp u te the matrix Q T A ∈ R r × d and its left singular v ectors U Q T A ∈ R r × d . 5. Let U Q T A,k ∈ R r × k denote the top k left singu lar v ectors of the matrix Q T A (the ﬁrst k columns of U Q T A ) and compute, f or all i ∈ [ n ], ˆ ℓ i =     QU Q T A,k  ( i )    2 2 . (27) 6. Return ˆ p i = ˆ ℓ i /k for all i ∈ [ n ]. Algorithm 5: Approxima ting th e statistical lev erage scores of a general matrix A (F rob enius norm case). 5.2 Lev erage Scores for F rob enius Norm Appro ximators. Algorithm 5 appro ximates the statistica l lev erage scores of a general m atrix A with rank parameter k in th e F r ob enius n orm case. I t tak es as inpu ts a matrix A ∈ R n × d with rank ( A ) = ρ and a rank parameter k ≪ ρ , and outp uts a set of num b ers ˆ p i for all i ∈ [ n ], namely our app ro ximations to the n ormalized lev erage scores of A with rank parameter k . It is w orth noting that P n i =1 ˆ ℓ i =   QU Q T A,k   2 F =   U Q T A,k   2 F = k and thus the ˆ p i sum up to one. The next lemma argues that there exists a matrix X ∈ R n × d of rank k that is suﬃciently close to A (in p articular, it is a mem b er of S ǫ with constan t probabilit y). Unlik e the previous section (the sp ectral norm case), w e will no w b e able to pr o vide a closed-form formula f or this matrix X and, more imp ortantl y , the normalized lev erage scores of X will b e exactly e q ual to the ˆ p i returned b y our algorithm. Th us , in the p arlance of Deﬁnition 3, we will get a 1-appro ximation to the normalized lev erage scores of A with ran k parameter k . Lemma 11 (F rob enius Sketc h) . Given A ∈ R n × d of r ank ρ , a r ank p ar ameter k such that 2 ≤ k < ρ , and an err or p ar ameter ǫ such that 0 < ǫ < 1 , let Π ∈ R d × r b e a stand ar d Gaussian matrix (with entries sele cte d in i.i.d. trials f r om N (0 , 1) ) with r ≥ k +  10 k ǫ + 1  . (28) L et B = A Π and let X b e as in Eqn. (29). Then, E h k A − X k 2 F i ≤  1 + ǫ 10  k A − A k k 2 F . The matrix B c an b e c ompute d in O  ndk ǫ − 1  time. 20 Let X = Q  Q T A  k ∈ R n × d , (29) where  Q T A  k is the b est rank- k a pp ro ximation to the matrix Q T A ; from standard linear algebra,  Q T A  k = U Q T A,k U T Q T A,k Q T A . Then, the ab o v e lemma is pro ve n in [10]. 11 No w, s in ce X has rank k , it follo ws that k A − X k 2 F ≥ k A − A k k 2 F and thus we can consider the non -n egativ e rand om v ariable k A − X k 2 F − k A − A k k 2 F and app ly Mark ov’ s inequalit y to get that k A − X k 2 F − k A − A k k 2 F ≤ ǫ k A − A k k 2 F holds with probability at least 0 . 9. Rearranging terms and taking squ are ro ots of b oth sides implies that k A − X k F ≤ √ 1 + ǫ k A − A k k F ≤ (1 + ǫ ) k A − A k k F . Th us , X ∈ S ǫ with p robabilit y at least 0 . 9. T o conclude our pro of, recall that Q is an orthonormal basis for the columns of B . F r om E q n . (29), X = Q  Q T A  k = QU Q T A,k U T Q T A,k Q T A = QU Q T A,k Σ Q T A,k V T Q T A,k . In the ab o v e, Σ Q T A,k ∈ R k × k is the diagonal m atrix conta ining the top k sin gular v alues of Q T A and V T Q T A,k ∈ R k × d is th e matrix whose ro ws are the top k right singular v ectors of Q T A . Thus, the left singular v ectors of the matrix X are exactly equal to the columns of the orthogonal matrix QU Q T A,k ; it no w follo ws that the ˆ ℓ i of Eqn. (27) are the lev erage scores of the matrix X and , ﬁnally , that th e ˆ p i returned by the p rop osed algorithm are the normalized lev erage scores of the matrix X . W e brieﬂy discuss the runnin g time of the prop osed algorithm. First, we can compute B in O ( ndr ) time. Then, the compu tation of Q tak es O ( nr 2 ) time. The compu tation of Q T A tak es O ( ndr ) time and the compu tation of U Q T A tak es O ( dr 2 ) time. Thus, the total time is equ al to O  ndr + ( n + d ) r 2  . Th e follo wing Theorem summ arizes the ab o ve discussion. Theorem 4. Given A ∈ R n × d , a r ank p ar ameter k , and an ac cur acy p ar ameter ǫ , Algorithm 5 c omputes a set of normalize d leve r age sc or es ˆ p i that ar e 1 -appr oximations to the normalize d lever- age sc or es of A with r ank p ar ameter k with pr ob ability at le ast 0 . 7 . The pr op ose d algorithm runs in O  ndk ǫ − 1 + ( n + d ) k 2 ǫ − 2  time. 6 Discussion W e will conclude with a discu s sion of our main resu lts in a b roader cont ext: un derstanding the relationship b et w een our main algorithm and a related estimator for the statistical lev erage scores; applying our main algorithm to s olv e u nder-constrained least squares p r oblems; and implemen ting v ariants of th e b asic algorithm in str eaming environmen ts. 6.1 A related estimator for t he lev erage scores Magdon-Ismail in [31] pr esen ted th e follo wing algorithm to estimate th e statistical lev erage scores: giv en as in put an n × d matrix A , with n ≫ d , th e algorithm pro ceeds as follo ws. • Comp ute Π A , wh ere the O  n ln d ln 2 n  × n matrix Π is a S R HT or another FJ L T. 11 More sp eciﬁcally , the proof may b e found in Lemma 33 in Section A.3; note that for our p urp oses here we set p =  10 k ǫ + 1  . 21 • Comp ute X = (Π A ) † Π. • F or t = 1 , . . . , n , compute the estimate ˜ w t = A T ( t ) X ( t ) and set w t = max n d l n 2 n 4 n , ˜ w t o . • Return the quant ities ˜ p i = w i / P n i ′ =1 w i ′ , for i ∈ [ n ]. [31] argued that the output ˜ p i ac hiev es an O (ln 2 n ) app ro ximation to all of the (normalized) statistica l leve rage scores of A in rough ly O ( n d 2 / ln n ) time. (T o our kno wledge, pr ior to our w ork here, this is the only known estimator that obtains an y non trivial p ro v able approxima tion to the lev erage scores of a matrix in o ( nd 2 ) time.) T o see the relati onsh ip b etw een this estimator and our main result, recall that ℓ i = e T i U U T e i = e T i AA † e i = x T i y i , where the v ector x T i = e T i A is cheap to compute and the v ector y i = A † e i is exp ensiv e to compute. Th e ab o ve algorithm eﬀectiv ely appro ximates y i = A † e i via a random p ro jection as ˜ y i = (Π A ) † Π e i , where Π is a SRHT or another FJL T. Since the estimates x T i ˜ y i are not necessarily p ositiv e, a truncatio n at the negativ e tail , follo w ed b y a renorm alizatio n step, m u st b e p erformed in order to arriv e at the ﬁnal estimator returned by the algorithm. T his tru ncation- renormalization s tep has the eﬀect of inﬂating the estimates of the sm all lev erage scores b y an O (ln 2 n ) factor. By wa y of comparison, Algorithm 1 essenti ally computes a s k etc h of AA † of th e form A (Π A ) † Π T that main tains p ositivit y for eac h of the ro w norm estimates. Although b oth Algorithm 1 and the algorithm of this subsection estimate AA † b y a matrix of the form A (Π A ) † Π T , there are notable d iﬀerences. The algorithm of this subsection do es not actually compu te or app ro ximate AA T directly; instead, it separates the matrix into t w o parts and computes the d ot pr o duct b et w een e T i A and (Π A ) † Π e i . P ositivit y is sacriﬁced a nd this leads to some complications in the estimat or; h o w eve r, the truncation step is in teresting, since, despite the fact that the estimates are “b iased” (in a manner s omewh at akin to w hat is obtained w ith “thresholding” or “regularization” p r o cedures), w e still obtain pr ov able appro ximation guaran- tees. The algorithm of this subsection is simpler (since it uses an ap p lication of only one rand om pro jection), alb eit at the cost of w eak er theoretical guarante es and a wo rse ru nning time than our main algorithm. A direction of consid er ab le practical in terest is to ev aluate empirically the p erforman ce of these t wo estimators, either for estimating all the lev erage s cores or (more inte r- estingly) for esti mating the largest lev erage scores for d ata m atrices for wh ic h the lev erage scores are qu ite nonuniform. 6.2 An application to under-constrained least-squares problems Consider the follo wing u nder-constrained least-squares pr oblem: min x ∈ R d k Ax − b k 2 , (30) where A ∈ R n × d has m uch fewer ro ws than columns, i.e. , n ≪ d . It is w ell-kno wn that w e can solv e this p roblem exactly in O ( n 2 d ) time and that the minimal ℓ 2 -norm solution is giv en by x opt = A † b. F or simplicit y , let’s assume that the input matrix A has full rank ( i.e. , rank ( A ) = n ) and thus k Ax opt − b k 2 = 0. In this section, we will argue that Algorithm 6 computes a simple, accurate estimator ˜ x opt for x opt . In words, Algorithm 6 samples a sm all n umb er of columns from A (n ote that the columns of A corresp ond to v ariables in our und er-constrained pr ob lem) and uses the sampled columns to compute ˜ x opt . How ever, in order to determine whic h columns w ill b e includ ed in the sample, 22 the algorithm will m ak e u se of th e statistical lev erage scores of the matrix A T ; m ore sp eciﬁca lly , columns (and thus v ariables) w ill b e c hosen with probabilit y prop ortional to the corresp ondin g statistica l lev erage score. W e will state Algorithm 6 assuming that these probabilities are parts of th e input; the follo wing theorem is our main qualit y-of-appro ximation result for Algorithm 6. Theorem 5. L et A ∈ R n × d b e a f u l l-r ank matrix with n ≪ d ; let ǫ ∈ (0 , 0 . 5] b e an ac cur acy p ar ameter; let δ ∈ (0 , 1) b e a failur e pr ob ability; and let x opt = A † b b e the minimal ℓ 2 -norm solution to the le ast-squar es pr oblem of Eqn. (30). L et p i ≥ 0 , i ∈ [ d ] , b e a set of pr ob abilities satisfying P d i =1 p i = 1 and p i ≥ β   V ( i )   2 2 n (31) for some c onstant β ∈ (0 , 1] . (Her e V ∈ R d × n is the matrix of the right sing u lar ve ctors of A .) If ˜ x opt is c ompute d via Al gorithm 6 then, with pr ob ability at le ast 1 − δ , k x opt − ˜ x opt k 2 ≤ 2 ǫ k x opt k 2 . Algor ithm 6 runs in O  n 3 ǫ − 2 β − 1 ln ( n /ǫβ δ ) + nd  time. Pr o of: Let th e singular v alue d ecomp osition of the f ull-rank matrix A b e A = U Σ V T , w ith U ∈ R n × n , Σ ∈ R n × n , and V ∈ R d × n ; n ote that all the diagonal entries of Σ are s trictly p ositiv e since A h as full rank. W e can no w apply Theorem 4 of Section 6.1 of [22] to get 12 that   I n − V T S S T V   2 =   V T V − V T S S T V   2 ≤ ǫ (32) for our c hoice of r with probability at least 1 − δ . Note that V T S ∈ R n × r (with r ≥ n ) and let σ i  V T S  denote the singular v alues of V T S for all i ∈ [ n ]; the ab o v e inequalit y imp lies that for all i ∈ [ n ]   1 − σ 2 i  V T S    ≤   I n − V T S S T V   2 ≤ ǫ ≤ 0 . 5 . Th us , all the singular v alues of V T S are strictly p ositiv e and h ence V T S has full rank equal to n . Also, using ǫ ≤ 0 . 5,   1 − σ − 2 i  V T S    ≤ ǫ 1 − ǫ ≤ 2 ǫ. (33) W e are no w ready to prov e our theorem: k x opt − ˜ x opt k 2 =    A T ( AS ) † T ( AS ) † b − A † b    2 =    V Σ U T  U Σ V T S  † T  U Σ V T S  † b − V Σ − 1 U T b    2 =    Σ U T U Σ − 1  V T S  † T  V T S  † Σ − 1 U T b − Σ − 1 U T b    2 =     V T S  † T  V T S  † Σ − 1 U T b − Σ − 1 U T b    2 . In th e ab o v e deriv ations w e su bstituted th e SVD of A , dr opp ed terms that do not c hange unitarily in v arian t n orms, and used th e f act that V T S and Σ ha v e full rank in order to simplify the pseudoinv erse. No w let  V T S  † T  V T S  † = I n + E and note that Eqn. (33) and the fact that V T S has full r ank imp ly k E k 2 =    I n −  V T S  † T  V T S  †    2 = max i ∈ [ n ]   1 − σ − 2 i  V T S    ≤ 2 ǫ. 12 W e apply Theorem 4 of S ection 6.1 of [22] with A = V T and note th at   V T   2 F = n ≥ 1,   V T   2 = 1, and  V T  ( i ) = V ( i ) . 23 Th us , we conclude our pro of b y observing that k x opt − ˜ x opt k 2 =   ( I n + E ) Σ − 1 U T b − Σ − 1 U T b   2 =   E Σ − 1 U T b   2 ≤ k E k 2   Σ − 1 U T b   2 ≤ 2 ǫ k x opt k 2 . In the ab o ve we u sed the fact that k x opt k 2 =   A † b   2 =   V Σ − 1 U T b   2 =   Σ − 1 U T b   2 . The runn in g time of the algorithm follo ws by obser v in g th at AS is an n × r matrix and th us computing its p seudoinv erse tak es O ( n 2 r ) time; computing x opt tak es an add itional O ( nr + dn ) time. ⋄ Input: A ∈ R n × d , b ∈ R n , error parameter ǫ ∈ (0 , . 5], failure pr obabilit y δ , and a set of probabilities p i (for all i ∈ [ d ]) summing up to one and satisfying Eqn . (31). Output: ˜ x opt ∈ R d . 1. Let r = 96 n β ǫ 2 ln  96 n β ǫ 2 √ δ  . 2. Let S ∈ R d × r b e an all-zeros matrix. 3. F or t = 1 , . . . , r do • Pick i t ∈ [ d ] su c h th at Pr ( i t = i ) = p i . • S i t t = 1 / √ r p i t . 4. Return ˜ x opt = A T ( AS ) † T ( AS ) † b . Algorithm 6: Approximat ely solving under-constrained least squ ares pr oblems. W e conclude the section with a few remarks. First, assuming that ǫ , β , and δ are constants and n ln n = o ( d ), it immediately follo ws that Algorithm 6 run s in o ( n 2 d ) time. It should b e clear that w e can use Theorem 1 and the r elated Algorithm 1 to appr o ximate the statistical lev erage scores, thus bypassing the need to exactly compute them. S econd, instead of appr o ximating the statistical lev erage scores needed in Algorithm 6, we could use the randomized Hadamard transform (essential ly p ost-multiply A b y a r andomized Hadamard transf orm to make all statis- tical lev erage scores u niform). T he resulting algorithm could b e theoretical ly analyzed follo wing the lines of [22]. It would b e int eresting to ev aluate exp erimenta lly the p erformance of th e tw o approac hes in r eal data. 6.3 Extension to streaming environm en ts In this section, w e consider the estimation of the lev erage s cores and of related statistics when the input data set is so large that an appr opriate w a y to view the data is as a d ata stream [36]. In this con text, one is inte rested in computing s tatistics of the data stream while making one p ass (or o ccasionally a few additional passes) ov er the data fr om external storage and us in g only a small amoun t of additional space. F or an n × d m atrix A , with n ≫ d , small additional space means 24 that the space complexit y only depen d s lo garithmic al ly on the high dimension n and p olynomial ly on the low dimension d . When w e d iscu ss bits of space, we assu m e th at the entries of A can b e discretized to O (log n ) bit in tegers, though al l of our results can b e generalized to arbitrary word sizes. Th e general strategy b ehind our algorithms is as follo ws. • As the data streams by , compute T A , for an appropr iate p roblem-dep end en t linear sketc hing matrix T , and also compute Π A , for a r andom p ro jection matrix Π. 13 • After the ﬁrst pass ov er the data, compu te the matrix R − 1 , as describ ed in Algorithm 1, corresp ondin g to Π A (or c ompu te the pseudoin v erse of Π A or the R matrix from any other QR decomp osition of A ). • Comp ute T AR − 1 Π 2 , for a rand om pro jection matrix Π 2 , such as the one u s ed by Algo- rithm 1. With the pro cedure outlined ab ov e, the matrix T is eﬀectiv ely applied to th e ro ws of AR − 1 Π 2 , i.e. , to the sketc h of A that has ro ws with E uclidean norms appro ximately equal to the ro w norms of U , and pairwise inner pro d ucts approxi mately equal to those in U . Thus statistics related to U can b e extracted. Large Lev erage Scores. Giv en an y n × d matrix A in a streaming setting, it is kno wn h o w to ﬁnd th e indices of all ro ws A ( i ) of A for which k A ( i ) k 2 2 ≥ τ k A k 2 F , for a parameter τ , and in addition it is kno w n h ow to compu te a (1 + ǫ )-appr o ximation to k A ( i ) k 2 2 for these large r o ws. The basic idea is to u s e the notion of ℓ 2 -sampling on m atrix A , namely , to samp le random en tries A ij with probabilit y A 2 ij / k A k 2 F . A single entry can b e sampled from this distrib ution in a single pass u sing O ( ǫ − 2 log 3 ( nd )) b its of sp ace [35, 5]. More precisely , these references demonstrate that there is a distribu tion o ve r O ( dǫ − 2 log 3 ( nd )) × n m atrices T f or which for an y ﬁxed matrix A ∈ R n × d , there is a pro cedure wh ic h give n T A , outputs a sample ( i, j ) ∈ [ n ] × [ d ] with probabilit y (1 ± ǫ ) A 2 i,j k A k 2 F ± n − O (1) . T ec hnically , these references concern sampling from ve ctors r ather th an matrices, so T ( A ) is a linear op erator which tr eats A as a length- nd vec tor and app lies the algorithm of [35, 5]. Ho wev er, by simply increasing the n umb er of r o ws in T b y a facto r of the small dimension d , w e can assume T is left matrix m ultiplication. By considering the marginal along [ n ], the probability that i = a , for an y a ∈ [ n ], is (1 ± ǫ ) k U ( a ) k 2 2 k U k 2 F ± ( nd ) − O (1) . By the coup on collector pr oblem, runnin g O ( τ − 1 log τ − 1 ) indep end en t copies is enough to ﬁnd a set con taining all rows A ( i ) for whic h k A ( i ) k 2 2 ≥ τ k A k 2 F , and no ro ws A ( i ) for wh ich k A ( i ) k 2 2 < τ 2 k A k 2 F with pr obabilit y at least 0 . 99. When ap p lied to our setting, we can app ly a random pro jection m atrix Π and a linear sketc hing matrix T which h as O ( dτ − 1 ǫ − 2 log 3 ( n ) log τ − 1 ) ro ws in the follo wing manner. First, T A and Π A are compu ted in the ﬁrst pass o v er the d ata; then, at the end of th e ﬁrst p ass, we compute R − 1 ; and ﬁnally , w e compute T AR − 1 Π 2 , for a ran d om pr o jection matrix Π 2 . This pro cedure eﬀect ive ly applies the matrix T to the ro ws of AR − 1 Π 2 , which h av e norms equal to the ro w norms of U , up to a factor of 1 + ǫ . Th e multiplica tion at the end b y Π 2 serv es only to sp eed up the time for 13 In th e oﬄine setting, one would use an SR HT or another FJL T, while in the strea ming setting one could use either of the fo llow ing. If the stream is such th at one sees eac h en tire column of A at once, then one could do an FJL T on the column. Alternatively , if one see up d ates to the individu al en tries of A in an arbitrary order, then one could app ly any sketching matrix, such as t h ose of [1] or of [17]. 25 pro cessing T AR − 1 . Thus, by the r esults o f [35, 5], w e can ﬁ n d all the leverag e scores k U ( i ) k 2 2 that are of magnitude at least τ k U k 2 F in small sp ace and a s in gle pass o v er the data. By increasing th e space by a factor of O ( ǫ − 2 log n ), we can also use the ℓ 2 -samples to estimate th e n orms k U ( i ) k 2 2 for the row indices i that we ﬁnd. En tropy . Giv en a distribution ρ , a statistic of ρ of in terest is the entrop y of this distrib u tion, where the entrop y is d eﬁ ned as H ( ρ ) = P i ρ ( i ) log 2 (1 /ρ ( i )) . This statistic can b e appro ximated in a streaming setting. In deed, it is kno wn that estimating H ( ρ ) up to an additive ǫ can b e r educed to (1 + ˜ ǫ )-app r o ximation of the ℓ p -norm of the vecto r ( ρ (1) , . . . , ρ ( n )), for O (log 1 /ǫ ) diﬀerent p ∈ (0 , 1) [28]. Here ˜ ǫ = ǫ / (log 3 1 /ǫ · log n ). When applied to our setting, the d istribution of in terest is ρ ( i ) = 1 d k U ( i ) k 2 2 . T o compute the en tropy of this d istribution, there exist sket c hin g matrices T for pro viding (1 + ǫ )-app r o ximations to the quan tit y F p ( F 2 ) of an n × d matrix A , where F p ( F 2 ) is deﬁned as P n i =1 k A ( i ) k 2 p 2 , u sing O ( ǫ − 4 log 2 n log 1 /ǫ ) bits of space (see Theorem 1 of [23]). Thus, to compu te the en trop y of the leve rage score distribu tion, we can d o the follo wing. First, main tain T A and Π A in the ﬁr st pass o v er the data, where T is a sk etc hing matrix for F p ( F 2 ), p ∈ (0 , 1). A t the end of the ﬁrst pass, compute R − 1 ; and ﬁnally , compute T AR − 1 Π 2 , whic h eﬀectiv ely applies the F p ( F 2 )-estimatio n matrix T to the rows of the matrix AR − 1 Π 2 . Therefore, by th e results of [28, 23], we can compute an estimate φ whic h is within an additive ǫ of H ( ρ ) us in g O ( dǫ − 4 log 6 n log 14 1 /ǫ ) b its of s pace and a sin gle pass. W e note th at it is also p ossible to estimate H ( ρ ) up to a m ultiplicativ e 1 + ǫ factor u s ing small, bu t more, space; see, e.g. , [28]. Sampling Row Iden tities. Another natural pr oblem is that of obtaining s amples of ro ws of A p rop ortional to their lev erage score imp ortance samp lin g p robabilities. T o do so, we use ℓ 2 - sampling [35, 5] as used ab o v e for ﬁndin g the large lev erage scores. First, compute T A and Π A in the ﬁr st p ass o ve r the d ata stream; then, compu te R − 1 ; and ﬁnally , compute T AR − 1 . T h us, b y applyin g the pro cedu res of [5] a total of s times indep endently , we obtain s samp les i 1 , . . . , i s , with replacemen t, of ro ws of A prop ortional to k U ( i 1 ) k 2 2 , . . . , k U ( i s ) k 2 2 , i.e. , to their lev erage score. The algorithm requires O ( sdǫ − 2 log 4 n ) bits of sp ace and ru ns in a single pass. T o obtain more than just the ro w iden tities i 1 , . . . , i s , e. g. , to obtain the actual samples, one can read oﬀ these ro ws from A in a second pass o v er the matrix. References [1] D. Achlioptas. Database-friendly random p ro jections: Johnson-lind enstrauss w ith binary coins. Journal of Computer and System Scienc es , 66(4):671 –687, 2003. [2] N. Ailon and B. C hazelle. Approximat e nearest neighbors and the fast Joh n son-Lindenstrau s s transform. In Pr o c e e dings of the 38th Annual A CM Symp osium on The ory of Computing , pages 557–563, 2006. [3] N. Ailo n and B. Chazelle. T h e fast Johnson-Lindens tr auss transform and appro ximate near- est n eighb ors . SIAM Journal on Computing , 39(1):30 2–322, 2009. [4] N. Ailon and E. Lib erty . F ast dimension reduction usin g Rademacher series on dual BCH co des. In Pr o c e e dings of the 19th Annual ACM-SIAM Symp osium on Discr ete Algorithms , pages 1–9, 2008. [5] A. Andoni, R. Krauthgamer, and K. Onak. Streaming algorithms from p recision sampling. T ec hnical rep ort. Preprint: arXiv:1011.1 263 (2010). 26 [6] H. Avron, P . Maymounk o v, and S . T oledo. Blendenpik: Sup erchargi ng LAP A CK’s least- squares solver. SIAM Journal on Sci entiﬁc Computing , 32:121 7–1236, 2010. [7] C. Bek as, A. Curioni, and I. F edu lov a. Low cost high p erformance uncertaint y quantiﬁcat ion. In Pr o c e e dings of the 2nd W orkshop on High Performanc e Computational Financ e , page Article No.: 8, 2009. [8] C. Bek as, E. Kokiop oulou, and Y. Saad. An estimat or for the d iagonal of a matrix. Applie d Numeric al Mathematics , 57:121 4–1229, 2007. [9] P . Bonacic h. Po wer and centralit y: A f amily of measures. The Americ an Journal of So ciolo gy , 92(5): 1170–118 2, 1987. [10] C . Boutsidis, P . Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix recon- struction. T ec hnical rep ort. P reprint: arXiv:1103.09 95 (2011). [11] C . Boutsidis, P . Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix recon- struction. In Pr o c e e dings of the 52nd A nnual IEEE Symp osium on F oundations of Computer Scienc e , pages 305–3 14, 2011. [12] C . Boutsidis, M.W. Mahoney , and P . Drineas. An impr ov ed app ro ximation a lgorithm for the column sub set selectio n problem. In Pr o c e e dings of the 20th Annual ACM-SIAM Symp osium on Discr ete Al gorithms , pages 968–9 77, 200 9. [13] E.J . Candes and B. Rech t. Exact matrix completion via con v ex optimization. T ec hnical rep ort. Prep rin t: arXiv:0805.447 1 (2008). [14] S . Chatterjee and A.S. Hadi. Inﬂuen tial observ ations, high lev erage p oin ts, and outliers in linear r egression. Statistic al Scienc e , 1(3):379–39 3, 1986. [15] S . Chatterjee and A.S . Hadi. Sensitivity Analysis in Line ar R e gr ession . John Wiley & Sons, New Y ork , 1988. [16] S . Chatterjee, A.S. Hadi, and B. Price. R e gr ession Anal ysis by Example . John Wiley & Sons, New Y ork , 2000. [17] A. Dasgupta, R. Kumar, and T. Sarl´ os. A sp arse Johnson-Lind enstrauss trans f orm. In Pr o c e e dings of the 42nd An nual ACM Symp osium on The ory of Computing , pages 341–350 , 2010. [18] P . Drineas, R. Kannan, an d M.W. Mahoney . F ast Mon te C arlo algorithms for matrices I: Appro ximating matrix m ultiplication. SIAM J ournal on Computing , 36:132 –157, 2006. [19] P . Drineas, J . Lewis, and P . Pasc h ou. Inferring geographic co ordin ates of origin f or Eur op eans using sm all panels of ancestry in formativ e m ark ers. PL oS ONE , 5(8):e11892 , 201 0. [20] P . Drin eas, M.W. Mahoney , and S. Muthukrishnan. Sampling algorithms for ℓ 2 regression and app lications. In Pr o c e e dings of the 17th Annual ACM-SIAM Symp osium on D iscr ete Algor ithms , pages 1127–1 136, 2006. [21] P . Drin eas, M.W. Mahoney , and S. Mu th ukrish nan. Relativ e-error CUR matrix decomp osi- tions. SIAM Journal on Matrix A nalysis and Applic ations , 30:844–881 , 200 8. [22] P . Drin eas, M.W. Mahoney , S. Muth ukrish nan, and T. Sarl´ os. F aster least squares app ro xi- mation. N u merische Mathematik , 117(2):2 19–249, 2010. 27 [23] S . Ganguly , M. Bansal, and S. Dub e. E s timating hybrid frequency momen ts of data strea ms. In Pr o c e e dings of the 2nd Annua l International Workshop on F r ontiers in Algorithmics , pages 55–66 , 2008. [24] S . Georgiev and S . Mukher j ee. Unpu b lished results, 2011. [25] A. Gittens and M. W. Mahoney . In p reparation. (2012). [26] G.H. Golub and C.F. V an Loan. Matrix Computations . J ohns Hopkins Univ ersit y Press, Baltimore, 1996. [27] N. Halk o, P .-G. Martinsson, and J. A. T ropp. Finding structure with randomness: Prob- abilistic algorithms for constructing approxima te matrix d ecomp ositions. SIAM R evie w , 53(2): 217–288, 2011. [28] N.J.A. Harv ey , J. Nelson, and K. On ak. Sk etc hing and streaming en tropy via appro ximation theory . I n Pr o c e e dings of the 49th Annual IEEE Symp osium on F oundations of Computer Scienc e , pages 489–4 98, 2008. [29] D.C. Hoaglin and R.E. W elsc h. Th e h at matrix in r egression and ANOV A. The Americ an Statistician , 32(1):17–22 , 1978. [30] E. A. Jonc kheere, M. Lou, J. Hespanha, and P . Barooah. Eﬀectiv e resistance of Gromo v- h yp erb olic graph s: Application to asymp totic sensor net work problems. I n Pr o c e e dings of the 46th IEEE Confer enc e on De cision and Contr ol , pages 1453–14 58, 2007. [31] M. Magdon-Ismail. Ro w sampling for matrix algorithms v ia a non-comm utativ e Bernstein b ound . T ec hnical rep ort. Pr eprint: arXiv:1008.0 587 (2010). [32] M. W. Mahoney . R andomize d algor ithms for matric es and data . F oundations and T rends in Mac hine Learning. NOW Pub lish ers, Boston, 2011. Also a v ailable at: arXiv:1104.5 557 . [33] M.W. Mahoney and P . Drineas. CUR matrix decomp ositions for improv ed data analysis. Pr o c . Natl. A c ad. Sci. U SA , 106:697–7 02, 2009. [34] X. Meng, M. A. Saunders, and M. W. Mahoney . LSRN: A parallel iterativ e solv er for strongly o v er- or un der-determined systems. T ec hnical rep ort. P r eprint: arXiv:1109.5 981 (2011). [35] M. Monemizadeh and D. P . W o o d r uﬀ. 1-pass relativ e-error l p -sampling w ith applicatio ns. In Pr o c e e dings of the 21st Annual ACM-SIAM Symp osium on Discr ete Algorithm s , pages 1143– 1160, 2010. [36] S . Muthukrishnan. Data Str e ams: Algorithms and Applic ations . F oundations and T r en ds in Theoretical C omp uter S cience. No w Pub lish ers Inc, Boston, 2005. [37] M.Z. Nashed, ed itor. Gener alize d Inverses and Applic ations . Academic Pr ess, New Y ork, 1976. [38] M.E.J. Newman. A measur e of b et wee nn ess centralit y based on random walks. So cial Networks , 27:39– 54, 2005. [39] C .H. Papadimitriou, P . Ragha v an, H. T amaki, and S. V empala. Laten t seman tic ind exing: a p r obabilistic analysis. Journal of Computer and System Sci enc es , 61(2):217 –235, 2000. 28 [40] P . Pa schou, J. L ewis, A. Ja ve d, and P . Drineas. Ancestry informativ e m ark ers for ﬁn e- scale individual assignmen t to w orldwid e p opulations. Journal of Me dic al Genetic s , page doi:10.11 36/jmg.2010. 078212, 2010. [41] P . P asc hou, E. Ziv, E.G. Bur c hard, S. Ch oudhry , W. Ro d riguez-Cin tron, M.W. Mahoney , and P . Drineas. PCA-correlated SNPs for structur e identiﬁcatio n in w orldwid e human p op- ulations. PL oS Genetics , 3:1672 –1686, 2007. [42] M. Ric hardson and P . Domingos. Mining kno wledge-sharing sites for viral mark eting. In Pr o c e e dings of the 8th Annua l ACM SIGKDD Confer enc e , pages 61–70 , 2002. [43] V. Rokhlin , A. Szlam, and M. Tyge rt. A randomized algorithm for prin cipal comp onen t analysis. SIAM Journal on Matrix Analysis and Applic ations , 31(3):110 0–1124, 2009. [44] V. Rokhlin and M. Tyg ert. A fast randomized algorithm for o verdetermined linear least- squares r egression. Pr o c. Natl. A c ad. Sci. USA , 105(36): 13212–13217, 2008. [45] T . S arl´ os. Impro ved appro ximation algorithms for large matrices via rand om p ro jections. In Pr o c e e dings of the 47th Annua l IEEE Symp osium on F oundations of Computer Scienc e , pages 143–152, 2006. [46] A. T alw alk ar and A. Ro stamizadeh. Matrix coherence and the Nystr¨ om metho d . In Pr o c e e d- ings of the 26th Confer enc e in Unc ertainty in Artiﬁcial Intel ligenc e , 2010. [47] P .F. V elleman and R.E. W elsch. Eﬃ cient compu ting of regression diagnostics. The A meric an Statistician , 35(4):234–2 42, 1981. 29

Fast approximation of matrix coherence and statistical leverage

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment