Randomized CP Tensor Decomposition

Randomized CP Tensor Decomposition Randomized CP T ensor Decomp osition N. Benjamin Eric hson erichson@berkeley.edu Dep artment of Statistics University of California, Berkeley, USA Krithika Manohar kmanohar@cal tech.edu Dep artment of A pplie d & Computational Mathematics California Institute of T e chnolo gy, Pasadena, USA Stev en L. Brun ton sbrunton@uw.edu Dep artment of Me chanic al Engine ering University of W ashington, Se attle, USA J. Nathan Kutz kutz@uw.edu Dep artment of A pplie d Mathematics University of W ashington, Se attle, USA Abstract The CANDECOMP/P ARAF A C (CP) tensor decomposition is a p opular dimensionalit y- reduction metho d for multiw ay data. Dimensionality reduction is often sought after since man y high-dimensional tensors hav e lo w intrinsic rank relative to the dimension of the am bient measuremen t space. How ever, the emergence of ‘big data’ p oses signiﬁcan t compu- tational challenges for computing this fundamental tensor decomp osition. By leveraging mo dern randomized algorithms, we demonstrate that coherent structures can b e learned from a smaller representation of the tensor in a fraction of the time. Thus, this simple but p o werful algorithm enables one to compute the approximate CP decomp osition even for massiv e tensors. The approximation error can thereby b e controlled via ov ersampling and the computation of p o w er iterations. In addition to theoretical results, several empirical results demonstrate the p erformance of the prop osed algorithm. Keyw ords: randomized algorithms; randomized least squares; dimension reduction; m ultilinear algebra; CP decomp osition; canonical p oly adic tensor decomp osition. 1. In tro duction A dv ances in data acquisition and storage technology hav e enabled the acquisition of mas- siv e amoun ts of data in a wide range of emerging applications. In particular, n umerous applications across the physical, biological, social and engineering sciences generate large m ultidimensional, m ulti-relational and/or m ulti-mo dal data. Eﬃcien t analysis of this data requires dimensionality reduction techniques. How ev er, traditionally emplo y ed matrix de- comp ositions techniques suc h as the singular v alue decomp osition (SVD) and principal comp onen t analysis (PCA) can become inadequate when dealing with multidimensional data. This is b ecause reshaping m ulti-mo dal data in to matrices, or data ﬂattening , can fail to reveal imp ortan t structures in the data. 1 Erichson, Manohar, Brunton, Kutz T ensor decomp ositions ov ercome the information loss from ﬂattening. The canonical CP tensor decomp osition expresses an N -w a y tensor as a sum of rank-one tensors to extract m ulti-mo dal structure. It is particularly suitable for data-driven disco v ery , as sho wn by (Hong et al., 2020) for v arious learning tasks on real w orld data. How ever, tensor decomp ositions of massiv e m ultidimensional data p ose a tremendous computational c hallenge. Hence, inno v ations that reduce the computational demands hav e become increasingly relev an t in this ﬁeld. The idea of tensor compression, for instance, eases computational bottlenecks b y constructing smaller (compressed) tensors, which are then used as a proxy to eﬃcien tly compute CP decomp ositions. Compressed tensors ma y be obtained, for example, using the T uc k er decomp osition of a tensor into one small core tensor and N unitary matrices How ever, this approac h requires the exp ensiv e computation of the left singular vectors for each mo de. Related W ork. This computational challenge can b e tac kled using mo dern randomized tec hniques dev elop ed to compute the SVD. T sourakakis (2010) presents a randomized T uc ker decomp osition algorithm based on the random sparsiﬁcation idea of A c hlioptas and McSherry (2007) for computing the SVD. Zhou et al. (2014) prop osed an accelerated randomized CP algorithm using the ideas of Halk o et al. (2011) without the use of p o wer iterations. An alternative randomized tensor algorithm prop osed b y Drineas and Mahoney (2007) uses random column selection. Related w ork b y V ervliet et al. (2014) proposed a sparsit y-promoting algorithm for incomplete tensors using compressed sensing. Another eﬃcien t approac h to building large-scale tensor decompositions is the sub division of a tensor in to a set of blo c ks. These smaller blo c ks can then b e used to approximate the CP decomp osition of the full tensor in a parallelized or distributed fashion (Phan and Cic ho c ki, 2011). Sidirop oulos et al. (2014) fused random projection and blocking in to a highly computationally eﬃcien t algorithm. More recently , V ervliet and Lathau w er (2016) prop osed a blo c k sampling CP decomp osition metho d for the analysis of large-scale tensors using randomization, demonstrating signiﬁcan t computational savings while attaining near optimal accuracy . These blo ck based algorithms are particularly relev an t if the tensor is to o large to ﬁt in to fast memory . Con tributions. W e presen t the randomized CP algorithm, whic h is closely related to the randomized metho ds of Halko et al. (2011). Our metho d pro ceeds in tw o stages. The ﬁrst stage uses random pro jections with pow er iterations to obtain a compressed tensor (Algorithm 1). The second stage applies either alternating least squares (ALS) or blo c k co ordinate descen t (BCD) to the compressed tensor (Algorithms 2 and 3), at signiﬁcantly reduced cost. Finally , the CP decomp osition of the original tensor is obtained b y projecting the compressed factor matrices back using the basis derived in Algorithm 1. Randomized algorithms for CP decomp osition hav e b een prop osed (Zhou et al., 2014; Battaglino et al., 2018), alb eit without incorp orating p o wer iterations. The pow er iteration concept is fundamen tal for mo dern high-p erformance randomized matrix decomp ositions, but to the b est of our knowledge, has not b een applied in the context of tensors. Without p o wer iterations, the p erformance of randomized algorithms based on random pro jections can suﬀer considerably in the presence of white noise. W e com bine p ow er iterations with o v ersampling to further control the error of the decomp osition. Em b edding the CP decomp osition in to this probabilistic framework allows us to ac hiev e signiﬁcan t computational sa vings, while pro ducing near-optimal approximation quality . F or 2 Randomized CP Tensor Decomposition 100x100x100 200x200x200 400x400x400 600x600x600 100x100x100x100 1 0 1 1 0 0 1 0 1 1 0 2 Time in s CP (BCD) CP (ALS) Randomized CP (BCD) Randomized CP (ALS) 100x100x100 200x200x200 400x400x400 600x600x600 100x100x100x100 Tensor dimension 0 20 40 60 80 100 120 Speedup Figure 1: Algorithm run times and speedups for v arying tensor dimensions for a rank k = 20 appro ximation. Sp eedups rise sharply with increasing dimensions. motiv ation, Figure 1 sho ws the computational savings of a rank k = 20 approximation for v arying tensor dimensions. Here w e compare the ALS and BCD solver for computing the deterministic and randomized CP decomp osition. The computational adv antage of the randomized algorithms is signiﬁcan t, while sacriﬁcing a negligible amount of accuracy . Organization. The paper is organized as follo ws: Section 2 brieﬂy reviews the CP decomp osition and randomized matrix decomp ositions. Section 3 introduces the randomized CP tensor decomp osition algorithm. Section 4 presents the ev aluation of the computational p erformance, and examples. Finally , Section 5 summarizes the researc h ﬁndings. 2. Bac kground Ideas for m ulti-wa y factor analysis emerged in the 1920s with the formulation of the p oly adic decomp osition by Hitc hco c k (1927). Ho w ever, the polyadic tensor decomp osition only ac hiev ed p opularit y muc h later in the 1970s with the canonical decomp osition ( CANDE- COMP ) in psychometrics, prop osed b y Carroll and Chang (1970). Concurrently , the metho d of parallel factors ( P ARAF A C ) was in tro duced in c hemometrics by Harshman (1970). Hence, this metho d became known as the CP ( CANDECOMP/P ARAF A C ) decomp osition. Ho w- ev er, in the past, computation w as sev erely inhibited by av ailable computational p o wer. T o da y , tensor decomp ositions enjoy increasing p opularity , yet run time bottlenecks p ersist. 3 Erichson, Manohar, Brunton, Kutz 2.1 Notation Scalars are denoted b y lo wer case letters x , vectors as b old low er case letters x , and matrices b y b old capitals X . T ensors are denoted by calligraphic letters X . The mo de- n unfolding of a tensor is expressed as X ( n ) , while the mo de- n folding of a matrix is deﬁned as X ( n ) . The v ector outer pro duct, the Kroneck er pro duct, the Khatri-Rao pro duct and the Hadamard pro duct are denoted b y ◦ , ⊗ ,  , and ∗ , respectively . The inner pro duct of t wo tensors is expressed as h· , ·i , and k · k F denotes the F rob enius norm for b oth matrices and tensors. 2.2 CP Decomp osition The CP decomp osition is the tensor equiv alen t of the SVD since it appro ximates a tensor b y a sum of rank-one tensors. Here, tensor r ank is deﬁned as the smallest sum of rank-one tensors required to generate the tensor (Hitchcock, 1927). The CP decomp osition approximates these rank-one tensors. Given a third order tensor X ∈ R I × J × K , the rank- R CP decomp osition is expressed as X ≈ R X r =1 a r ◦ b r ◦ c r , (1) where ◦ denotes the outer product. Sp eciﬁcally , each rank-one tensor is form ulated as the outer pro duct of the rank-one comp onen ts a r ∈ R I , b r ∈ R J , and c r ∈ R K . Components are often constrained to unit length with the weigh ts absorb ed into the v ector λ = [ λ 1 , ..., λ R ] ∈ R R . Equation (1) can then b e re-expressed as (See Fig. 2) X ≈ R X r =1 λ r · a r ◦ b r ◦ c r . (2) More compactly the comp onen ts can b e expressed as factor matrices, i.e., A = [ a 1 , a 2 , ..., a R ] , B = [ b 1 , b 2 , ..., b R ] , and C = [ c 1 , c 2 , ..., c R ] . Using the Kruskal op erator as deﬁned by K olda and Bader (2009), (2) can b e more compactly expressed as X ≈ [ [ λ ; A , B , C ] ] . 2.3 Randomized Matrix Algorithms The eﬃcient computation of lo w rank matrix approximations is a ubiquitous problem in mac hine learning and data mining. Randomized matrix algorithms ha v e b een demonstrated Figure 2: Sc hematic of the CP decomp osition. 4 Randomized CP Tensor Decomposition to b e highly comp etitiv e and robust when compared to traditional deterministic metho ds. Randomized algorithms aim to construct a smaller matrix (henceforth called sketc h) designed to capture the essential information of the source matrix (Mahoney, 2011; Drineas and Mahoney, 2016). The sketc h can then b e used for v arious learning tasks. There exist sev eral strategies for obtaining such a sketc h, with random pro jections b eing the most robust oﬀ-the-shelf approach. Randomized algorithms hav e b een in particular studied for computing the near-optimal low-rank SVD (F rieze et al., 2004; Liberty et al., 2007; W o olfe et al., 2008; Martinsson et al., 2011). F ollo wing the seminal w ork by Halko et al. (2011), a randomized algorithm computes the lo w-rank matrix approximation A ≈ Q B m × n m × k k × n where the target rank is denoted as k , and assumed to b e k  min { m, n } . In tuitively , we construct a sk etc h Y ∈ R m × k b y forming a random linear weigh ted com bination of the columns of the source matrix A . More concisely , we construct the sketc h as Y = AΩ , (3) where Ω ∈ R m × k is a random test matrix with en tries drawn from, for example, a Gaussian distribution. If A has exact rank k , then the sk etc h Y spans with high probabilit y a basis for the column space of the source matrix. Ho w ev er, most data matrices do not feature exact rank (i.e., the singular v alues { σ i } n i = k +1 are non-zero). Th us, instead of just using k samples, it is fa v orable to construct a slightly ov ersampled sk etch Y ∈ R m × l whic h has l = k + p columns, were p denotes the num b er of additional samples. In most situations small v alues p = { 10 , 20 } are suﬃcient to obtain a go od basis that is comparable to the best p ossible basis (Martinsson, 2016). An orthonormal basis Q ∈ R m × l is then obtain ed via the QR-decomp osition Y = QR , such that A ≈ QQ > A is satisﬁed. Finally , the source matrix A is pro jected to lo w-dimensional space B = Q > A , (4) where B ∈ R l × n . (Note, that Eq. 4 requires a second pass ov er the source matrix.) The matrix B can then b e used to eﬃcien tly compute the matrix decomp osition of interest, e.g., the SVD. The approximation error can b e controlled by a combination of ov ersampling and p o wer iteration (Rokhlin et al., 2009; Halko et al., 2011; Gu, 2015). Randomized matrix algorithms are not only pass eﬃcien t, but they also hav e the abilit y to exploit modern computational parallelized and distributed computing arc hitectures. Implemen tations in MA TLAB , C and R are provided by Szlam et al. (2014), V oronin and Martinsson (2015), and Eric hson et al. (2016). 3. Randomized CP Decomp osition Giv en a third order tensor X ∈ R I × J × K , the ob jectiv e of the CP decomposition is to ﬁnd a set of R normalized rank-one tensors { a r ◦ b r ◦ c r } R r =1 whic h b est approximates X , i.e., 5 Erichson, Manohar, Brunton, Kutz minimizes the F rob enius norm minimize ˆ X k X − ˆ X k 2 F sub ject to ˆ X = R X r =1 a r ◦ b r ◦ c r . (5) The challenge is that this problem is highly nonconv ex and unlik e PCA there is no closed- form solution. Solution metho ds for this optimization problem therefore rely on iterative sc hemes (e.g., alternating least squares). These solvers are not guaranteed to ﬁnd a global minim um, yet for man y practical problems, they can ﬁnd high-quality solutions. Iterativ e sc hemes, ho w ev er, can b e computationally demanding if the dimensions of X are large. F ortunately , only the column spaces of the mo des are of imp ortance for obtaining the factor matrices A , B , C , rather then the individual columns of the mo de matricizations X (1) , X (2) , X (3) of the tensor X . This is because the CP decomp osition learns the comp onen ts based on prop ortional v ariations in in ter-p oin t distances b et w een the comp onents. Therefore, a compressed tensor B ∈ R k × k × k m ust preserv e pairwise Euclidean distances, where k ≥ R . This in turn requires that column spaces, and thus pairwise distances, are approximately preserv ed b y compression - this can b e achiev ed by generalizing the concepts of randomized matrix algorithms to tensors. W e build up on the methods introduced b y Martinsson et al. (2011) and Halko et al. (2011), as well as related work on randomized tensors b y Drineas and Mahoney (2007), who prop osed a randomized algorithm based on random column selection. Figure 3 sho ws the sc hematic of the randomized CP decomp osition architecture. Our approac h proceeds in t wo stages. The ﬁrst stage, detailed in Section 3.1, applies random pro jections with p o w er iteration to obtain a compressed tensor B , with expressivit y analysis in Section 3.1.1. In Section 3.2, w e describ e t w o algorithms for performing CP on the compressed tensor B and approximating the original CP factor matrices. Section 3.3 pro vides additional details ab out our implemen tation in Python. X [ [ A , B , C ] ] B [ [ ˜ A , ˜ B , ˜ C ] ] CP decomp osition CP decomp osition A degree of random- ness is employ ed to derive a smaller tensor from the (big) input tensor. Recov er near- optimal factors. T ensor F actor matrices Big Small Figure 3: Schematic of the randomized CP decomp osition architecture. First, a degree of randomness is employ ed to deriv e a smaller tensor B from the big tensor X . Then, the CP decomp osition is p erformed on B . Finally , the near-optimal high-dimensional factor matrices A , B and C are recov ered from the approximate factor matrices ˜ A , ˜ B and ˜ C . 6 Randomized CP Tensor Decomposition 3.1 Randomized T ensor Algorithm The aim is to use randomization as a computational resource to eﬃcien tly build a suitable basis that captures the action of the tensor X . Assuming an N -w a y tensor X ∈ R I 1 ×···× I N , the aim is to obtain a smaller compressed tensor B ∈ R k ×···× k , so that its N tensor mo des capture the action of the input tensor modes. Hence, we seek a natural basis in the form of a set of orthonormal matrices { Q n ∈ R I n × R n } N n =1 , so that X ≈ X × 1 Q 1 Q > 1 × 2 · · · × N Q N Q > N . (6) Here the op erator × n denotes tensor-matrix multiplication deﬁned as follows. Deﬁnition 1 The n -mo de matrix pr o duct X × n Q n Q > n multiplies a tensor by the matric es Q n Q > n in mo de n , i.e., e ach mo de- n ﬁb er is multiplie d by Q n Q > n M = X × n Q n Q > n ⇔ M ( n ) = Q n Q > n X ( n ) . Giv en a ﬁxed target rank k , these basis matrices can b e eﬃciently obtained using a randomized algorithm. First , the approxi mate basis for the n -th tensor mo de is obtained b y dra wing k random vectors ω 1 , . . . , ω k ∈ R Q i 6 = n I i from a Gaussian distribution. (Note that if N is large, it might b e fav orable to dra w the entries of ω from a quasi random sequence, e.g., Halton or Sob ol sequence. These so-called quasi-random num b ers are kno wn to b e less correlated in high-dimensions.) Stac ked together, the k random vectors ω form the random test matrix Ω n ∈ R Q i 6 = n I i × k used to sample the column space of X ( n ) ∈ R I n × Q i 6 = n I i as follows Y n = X ( n ) Ω n , (7) where Y n ∈ R I n × k is the sketc h. The sketc h serv es as an approximate basis for the range of the n -th tensor mo de. Probabilit y theory guarantees that the set of random vectors { ω i } k i =1 are linearly indep enden t with high probability . Hence, the corresp onding random pro jections y 1 , . . . , y k eﬃcien tly sample the range of a rank deﬁcien t tensor mo de X ( n ) . The economic QR decomposition of the sketc h Y n = Q n R n is then used to obtain a natural basis, so that Q n ∈ R I n × k is orthonormal and has the same column space as Y n . The ﬁnal step restricts the tensor mo de to this lo w-dimensional subspace B n = X × n Q > n ⇔ B n = Q > n X ( n ) . (8) Th us, after N iterations a compressed tensor B and a set of orthonormal matrices is obtained. Since this is an iterative algorithm, we set X ← B n after each iteration. The n um b er of columns of the basis matrices form a trade-oﬀ b et w een accuracy and computational p erformance. W e aim to use as few columns as p ossible, yet allow an accurate appro ximation of the input tensor. Assuming that the tensor X exhibits lo w-rank structure, or equiv alently , the rank R is muc h smaller than the am bien t dimensions of the tensor, the basis matrices will b e an eﬃcient represen tation. In practice, the compression p erformance is improv ed through using ov ersampling, i.e., dra wing l = k + p random vectors where k is the target rank and p the ov ersampling parameter. The randomized algorithm as presented requires that the mo de- n unfolding of the tensor has a rapidly deca ying sp ectrum in order to achiev e goo d p erformance. How ev er, this 7 Erichson, Manohar, Brunton, Kutz assumption is often not suitable, and the sp ectrum exhibits slow decay if the tensor is compressed several times. T o o vercome this issue, the algorithm’s p erformance can b e substan tially impro v ed using p o w er iterations (Rokhlin et al., 2009; Halk o et al., 2011; Gu, 2015). Po w er iterations turn a slowly deca ying spectrum in to a rapidly deca ying one b y taking p o wers of the tensor mo des. Thus, instead of sampling X ( n ) w e sample from the follo wing tensor mo de X q ( n ) : = ( X ( n ) X > ( n ) ) q X ( n ) , where q denotes a small in teger. This p o wer op eration enforces that the singular v alues σ j of X q ( n ) are { σ 2 q +1 j } j . Instead of using (7), an improv ed sk etch is computed Y n = ( X ( n ) X > ( n ) ) q X ( n ) Ω n . (9) Ho w ev er, if (9) is implemented in this form the basis may b e distorted due to round-oﬀ errors. Therefore in practice, normalized subspace iterations are used to form the sketc h, meaning that the sk etch is orthornormalized b etw een eac h p o wer iteration in order to stabilize the algorithm. F or implementation, see V oronin and Martinsson (2015) and Szlam et al. (2014). The com bination of o v ersampling and additional p o wer iterations can b e used to control the trade-oﬀ b et w een approximation quality and computational eﬃciency of the randomized tensor algorithm. Our results, for example, sho w that just q = 2 subspace iterations and an ov ersampling parameter of ab out p = 10 achiev es near-optimal results. Algorithm 1 summarizes the computational steps. 3.1.1 Expressivity anal ysis The av erage b eha vior of the randomized tensor algorithm is characterized us ing the exp ected residual error E k E k F = k X − ˆ X k F , (10) where ˆ X = X × 1 Q 1 Q > 1 × 2 · · · × N Q N Q > N . Theorem 1 generalizes the matrix version of Theorem 10.5 formulated b y Halk o et al. (2011) to the tensor case. Theorem 1 (Exp ected F rob enius error) Consider a low-r ank r e al N -way tensor X ∈ R I 1 ×···× I N . Then the exp e cte d appr oximation err or, given a tar get r ank k ≥ 2 and an oversampling p ar ameter p ≥ 2 for e ach mo de, is E k X − X × 1 Q 1 Q > 1 × 2 · · · × N Q N Q > N k F ≤ s 1 + k p − 1 · v u u u t N X n =1 X j >k σ 2 nj , (11) wher e σ nj denotes the j singular value of the mo de- n unfolding of the sour c e tensor X . F or the pro of see App endix A. Intuitiv ely , the pro jection of eac h tensor mo de onto a low- dimensional space in tro duces an additional residual. This is expressed by the double sum on the righ t hand side. If the lo w-rank approximation captures the column space of each mo de accurately , then the singular v alues j > k for each mo de n are small. Moreo ver, the error can b e improv ed using the ov ersampling parameter. The computation of additional p o wer (subspace) iterations can improv e the error further. This result again follo ws by generalizing the results of Halk o et al. (2011) to tensors. Sharp er performance b ounds for b oth o versampling and additional pow er iterations can b e derived following, for instance, the results by Witten and Candes (2015). 8 Randomized CP Tensor Decomposition Algorithm 1 A prototype randomized tensor compression algorithm. Require: An N -w a y tensor X , and a desired target rank k . Optional: Parameters p and q to con trol ov ersampling, and p ow er iterations. (1) B = X initialize compressed tensor (2) for n = 1 , . . . , N iterate ov er all tensor mo des (3) l = k + p sligh t ov ersampling (4) I , J = dim ( B ( n ) ) dimension of the n -th tensor mode (5) Ω = rand ( J, l ) generate random test matrix (6) Y = B ( n ) Ω compute sampling matrix (7) for j = 1 , . . . , q normalized p o w er iterations (optional) (8) [ Q , ∼ ] = lu ( Y ) (9) [ Z , ∼ ] = lu ( B > ( n ) Q ) (10) Y = B ( n ) Z (11) end for (12) [ Q n , ∼ ] = qr ( Y ) orthonormalize sampling matrix (13) B = B × n Q > n pro ject tensor to smaller space (14) end for Return: Compressed tensor B of dimension l × · · · × l , and a set of orthonormal basis matrices { Q n ∈ R I n × l } N n =1 . Remark 1 Due to the c onvenient mathematic al pr op erties of the normal distribution, a Gaussian r andom test matrix is often assume d in the or etic al r esults, however, in pr actic e a uniform distribute d r andom test matrix is suﬃcient. The p erformanc e c an b e further impr ove d using structur e d r andom matric es (W o olfe et al., 2008). If information is uniformly distribute d acr oss the data, r andomly sele cte d c olumns c an also b e use d to build a suitable b asis as wel l, which avoids the matrix multiplic ation in step (6). Remark 2 F or numeric al stability, normalize d p ower iter ations using the pivote d LU de- c omp osition ar e c ompute d in step 7-11. W e r e c ommend a default value of q = 2 . Remark 3 In pr actic e, the user c an de cide which mo des to c ompr ess and sp e cify diﬀer ent oversampling p ar ameters for these mo des. 3.2 Optimization Strategies There are sev eral optimization strategies for minimizing the ob jective function deﬁned in (12) , of which w e consider, alternating least squares (ALS) and blo c k co ordinate descen t (BCD). Both metho ds are suitable to operate on a compressed tensor B ∈ R k ×···× k , where k ≥ R . The optimization problem (5) is reform ulated minimize ˆ B k B − ˆ B k 2 F sub ject to ˆ B = R X r =1 ˜ a r ◦ ˜ b r ◦ ˜ c r . (12) 9 Erichson, Manohar, Brunton, Kutz Once the compressed factor matrices ˜ A ∈ R k × R , ˜ B ∈ R k × R , ˜ C ∈ R k × R are estimated, the full factor matrices can b e reco v ered A ≈ Q 1 ˜ A , B ≈ Q 2 ˜ B , C ≈ Q 3 ˜ C , (13) where Q 1 ∈ R I × k , Q 2 ∈ R J × k , Q 3 ∈ R K × k denote the orthonormal basis matrices. F or simplicit y w e fo cus on third order tensors, but the result generalizes to N -w ay tensors. 3.2.1 ALS Algorithm Due to its simplicity and eﬃciency , ALS is the most p opular metho d for computing the CP decomp osition (Comon et al., 2009; K olda and Bader, 2009). W e note that the optimization (12) is equiv alent to minimize ˜ A , ˜ B , ˜ C k B − R X r =1 ˜ a r ◦ ˜ b r ◦ ˜ c r k 2 F with resp ect to the factor matrices ˜ A , ˜ B and ˜ C . The tensor B can further b e expressed in matricized form B (1) ≈ ˜ A ( ˜ C  ˜ B ) > , B (2) ≈ ˜ B ( ˜ C  ˜ A ) > , B (3) ≈ ˜ C ( ˜ B  ˜ A ) > , where  denotes the Khatri-Rao pro duct. The optimization problem in this form is non- con v ex. Ho w ev er, an estimate for the factor matrices can b e obtained using the least-squares metho d as follows. The ALS algorithm up dates one component, while holding the other tw o comp onen ts ﬁxed, in an alternating fashion un til con vergence. It iterates o v er the following subproblems ˜ A j +1 = argmin ˜ A k B (1) − ˜ A ( ˜ C j  ˜ B j ) > k , (14) ˜ B j +1 = argmin ˜ B k B (2) − ˜ B ( ˜ C j  ˜ A j +1 ) > k , (15) ˜ C j +1 = argmin ˜ C k B (3) − ˜ C ( ˜ B j +1  ˜ A j +1 ) > k . (16) Eac h step therefore in volv es a least-squares problem which can b e solv ed using the Khatri-Rao pro duct pseudo-inv erse. Algorithm 2 summarizes the computational steps. Deﬁnition 2 The Khatri-R ao pr o duct pseudo-inverse is deﬁne d as ( A  B ) † = ( A > A ∗ B > B ) † ( A  B ) > , wher e the op er ator ∗ denotes the Hadamar d pr o duct, i.e., the elementwise multiplic ation of two e qual size d matric es. There exist few general con vergence guaran tees for the ALS algorithm (Usc hma jew, 2012; W ang and Chu, 2014). Moreov er, the ﬁnal solution tends to dep end on the initial guess ˜ A 0 , ˜ B 0 and ˜ C 0 . A standard initial guess uses the eigen v ectors of B (1) B > (1) , B (2) B > (2) , B (3) B > (3) (Bader and K olda, 2015). F urther, it is imp ortant to note that normalization of the factor matrices is necessary after eac h iteration to ac hiev e go o d conv ergence. This 10 Randomized CP Tensor Decomposition Algorithm 2 A prototype randomized CP algorithm using ALS. Require: An I × J × K tensor X , and a desired target rank R . Optional: Parameters p and q to con trol ov ersampling, and p ow er iterations. (1) B , Q A , Q B , Q C = compress ( X , R, p, q ) compress tensor using Algorithm 1 (2) B , C = [ eig ( B (2) , B > (2) ) , eig ( B (3) , B > (3) )] use ﬁrst R eigen v ectors for initialization (3) rep eat (4) A = B (1) ( C  B )( C > C ∗ B > B ) † (5) A = A / norm ( A ) (6) B = B (2) ( C  A )( C > C ∗ A > A ) † (7) B = B / norm ( B ) (8) C = B (3) ( B  A )( B > B ∗ A > A ) † (9) λ = norm ( C ) (10) C = C / λ (11) un til conv ergence criterion is reac hed (12) A , B , C = [ Q A A , Q B B , Q C C ] reco ver factor matrices (13) re-normalize the factor matrices and up date the scaling v ector λ Return: Normalized factor matrices A , B , C and the scaling v ector λ . prev en ts singularities of the Khatri-Rao pro duct pseudo-inv erse Kolda and Bader (2009). The algorithm can b e further impro v ed b y reformulating the ab o ve subproblems as regularized least-squares problems, see for instance Li et al. (2013) for tec hnical details and con v ergence results. The structure imp osed b y the ALS algorithm on the factor matrices p ermits the form ulation of non-negative, or sparsit y-constrained tensor decomp ositions (Cichocki et al., 2009). 3.2.2 BCD Algorithm While ALS is the most p opular algorithm for computing the CP decomp osition, man y alternativ e algorithms hav e b een developed. One alternate approach is based on blo c k co ordinate descent, BCD (Xu and Yin, 2013). Cichocki and Phan (2009) ﬁrst proposed this approac h for computing nonnegativ e tensor factorizations. The BCD algorithm is based on the idea of successiv e rank-one deﬂation. Unlik e ALS, which up dates the en tire factor matrix at each step, BCD computes the rank-1 tensors in a hierarchical fashion. Therefore, the algorithm treats each comp onen t a r , b r , c r as a blo c k. First, the most correlated rank-1 tensor is computed; then the second most correlated rank-1 tensor is learned on the residual tensor, and so on. Assuming that ˜ R = r − 1 comp onen ts hav e b een computed, then the r -th compressed residual tensor Y res is deﬁned Y res = B − ˜ R X r =1 ˜ a r ◦ ˜ b r ◦ ˜ c r . (17) 11 Erichson, Manohar, Brunton, Kutz Algorithm 3 A prototype randomized CP algorithm using BCD. Require: An I × J × K tensor X , and a desired target rank R . Optional: Parameters p and q to con trol ov ersampling, and p ow er iterations. (1) B , Q A , Q B , Q C = compress ( X , R, p, q ) compress tensor using Algorithm 1 (2) B , C = [ eig ( B (2) , B > (2) ) , eig ( B (3) , B > (3) )] use ﬁrst R eigen v ectors for initialization (3) Y = B initialize residual tensor (4) for r = 1 , . . . , R compute rank- r approximation (5) rep eat (6) a r = Y (1) ( c r  b r )( c > r c r ∗ b > r b r ) † (7) a r = a r / norm ( a r ) (8) b r = Y (2) ( c r  a r )( c > r c r ∗ a > r a r ) † (9) b r = b r / norm ( b r ) (10) c r = Y (3) ( b r  a r )( b > r b r ∗ a > r a r ) † (11) λ r = norm ( c r ) (12) c r = c r / λ r (13) un til conv ergence criterion is reached (14) Y = B − [ [ λ [1: r ] ; A [: , 1: r ] , B [: , 1: r ] , C [: , 1: r ] ] ] up date residual tensor (15) end for (16) A , B , C = [ Q A A , Q B B , Q C C ] reco ver factor matrices (17) re-normalize the factor matrices and up date the scaling v ector λ Return: Normalized factor matrices A , B , C and the scaling v ector λ . The algorithm then iterates ov er the following subproblems ˜ a j +1 r = argmin ˜ a r k Y res (1) − ˜ a r ( ˜ c j r  ˜ b j r ) > k , (18) ˜ b j +1 r = argmin ˜ b r k Y res (2) − ˜ b r ( ˜ c j r  ˜ a j +1 r ) > k , (19) ˜ c j +1 r = argmin ˜ c r k Y res (3) − ˜ c r ( ˜ b j +1 r  ˜ a j +1 r ) > k . (20) Note that the computation can b e more eﬃciently ev aluated without explicitly constructing the residual tensor Y res (Kim et al., 2014). Algorithm 3 summarizes the computation. 3.3 Implemen tation Details The algorithms w e presen t are implemen ted in the programming language Python , using n umerical linear algebra to ols provided by the SciPy (Op en Source Library of Scien tiﬁc T o ols) pac kage (Jones et al., 2001). SciPy provides MKL (Math Kernel Library) accelerated high p erformance implementations of BLAS and LAP A CK routines. Th us, all linear algebra op erations are threaded and highly optimized on In tel pro cessors. The implementation of the CP decomposition follows the MA TLAB T ensor T o olb ox implemen tation (Bader and Kolda, 2015). This implementation normalizes the comp onen ts after each step to ac hiev e better 12 Randomized CP Tensor Decomposition con v ergence. F urthermore, we use eigen v ectors (see ab o ve) to initialize the factor matrices. In terestingly , randomly initialized factor matrices ha v e the ability to ac hieve sligh tly b etter appro ximation errors, but re-running the algorithms sev eral times with diﬀeren t random seeds can display signiﬁ can t v ariance in the results. Hence only the former approac h is used for initialization. W e note that the randomized algorithm introduces some randomness and sligh t v ariations into the CP decomp ositions as w ell. Ho w ev er, randomization can also act as an implicit regularization on the CP decomp osition (Mahoney, 2011), meaning that the results of the randomized algorithm can b e in some cases even ‘b etter’ than the results of the corresp onding deterministic implemen tation. Finally , the conv ergence criterion is deﬁned as the change in ﬁt, follo wing Bader and Kolda (2015). The algorithm therefore stops when the improv emen t of the ﬁt ρ is less then a predeﬁned threshold, where the ﬁt is computed using ρ = 1 − ( k X k 2 F + k ˆ X k 2 F − 2 · h ˆ X , X i ) / k X k 2 F . 4. Numerical Results The randomized CP algorithm is ev aluated on a n umber of examples where the near optimal approximation of massive tensors can b e achiev ed in a fraction of the time using the randomized algorithm. Appro ximation accuracy is computed with the relativ e error k X − ˆ X k F / k X k F , where ˆ X denotes the appro ximated tensor. 4.1 Computational P erformance The robustn ess of the randomized CP algorithm is ﬁrst assessed on random low-rank tensors. Here we illustrate the appro ximation quality in the presence of additive white noise. Figure 4 sho ws the av erage relativ e errors ov er 100 runs for v arying signal-to-noise ratios (SNR). In the case of high SNRs, all algorithms con v erge tow ards the same relative error. How ever, at excessive levels of noise (i.e., SNR < 4 ) the deterministic CP algorithms exhibit small gains in accuracy ov er the randomized algorithms using q = 2 p o w er iterations, with which b oth the ALS and BCD algorithm sho w the same performance. The p erformance of the randomized algorithm without pow er iterations ( q = 0 ) is, how ev er, p o or, and stresses the Figure 4: A verage relative error, plotted on a log scale, against increasing signal to noise ratio. The analysis is p erformed on a rank R = 50 tensor of dimension 100 × 100 × 100 . Po w er iterations improv e the the approximation accuracy considerably , while the p erformance without p o w er iterations is p oor. 13 Erichson, Manohar, Brunton, Kutz imp ortance of the p o wer op eration for real applications. The ov ersampling parameter for the randomized algorithms is set to p = 10 . Increasing p can further improv e accuracy . Next, the reconstruction errors and runtimes for tensors of v arying dimensions are compared. Figure 5 sho ws the a verage ev aluation results ov er 100 runs for random lo w-rank tensors of diﬀerent dimensions, and for v arying target ranks k . The randomized algorithms ac hiev e near optimal approximation accuracy while demonstrating substan tial computational sa vings. In terestingly , the relative error ac hiev ed by the BCD decreases sharply by ab out one order of magnitude when the target rank k matc hes the actual tensor rank (here R = 50 ). The computational adv antage b ecomes more pronounced with increasing tensor dimen- sions, and as the n um b er of iterations required for con vergence increase. Using random tensors as presented here, all algorithms rapidly conv erge after ab out 4 to 6 iterations. Ho w ev er, it is evident that the computational cost p er iteration of the randomized algorithm is substantially low er. Thus, the computational savings can b e even more substantial in real w orld applications that may require several h undred iterations to conv erge. Overall, (a) T ensor of dimension 100 × 100 × 100 . (b) T ensor of dimension 400 × 400 × 400 . (c) T ensor of dimension 100 × 100 × 100 × 100 . Figure 5: Random tensor approximation and p erformance for rank R = 50 tensors: rCP metho ds ac hieve sp eedups by 1 - 2 orders of magnitude and the same accuracy as their deterministic counterpart. 14 Randomized CP Tensor Decomposition the ALS algorithm is computationally more eﬃcient than BCD, i.e., the deterministic ALS algorithm is faster than the BCD b y nearly one order of magnitude, while the randomized algorithms exhibit similar computational timings. Similar p erformance results are ac hiev ed for higher order tensors. Figure 5c sho ws the computational p erformance for a 4 -w a y tensor of dimension 100 × 100 × 100 × 100 . Again, the randomized algorithms ac hieve speedups of 1-2 orders of magnitude, while attaining go od appro ximation errors. 4.2 Examples The examples demonstrate th e adv antages of the randomized CP decomp osition. The ﬁrst is a m ultiscale toy video example, and the second is a simulated ﬂo w ﬁeld behind a stationary cylinder. Due to the b etter and more natural in terpretabilit y of the BCD algorithm, only this algorithm is considered in subsequent sections. 4.2.1 Mul tiscale Toy Video Example The approximation of the underlying spatial modes and temporal dynamics of a system is a common problem in signal pro cessing. In the follo wing, w e consider a toy example presen ting multiscale in termittent dynamics in the time direction. The data consists of four Gaussian mo des, each undergoing diﬀeren t frequencies of in termitten t oscillation, for 215 times steps in the temporal direction on a t wo-dimensional spatial grid ( 200 × 200 ). The resulting tensor is of dimension 200 × 200 × 215 . Figure 6 shows the corresp onding mo des and the time dynamics. This problem b ecomes ev en more c hallenging when the underlying structure needs to b e reconstructed from noisy measurements. T raditional matrix mo des time dynamics tensor time time Figure 6: Illustration of the multiscale toy video. The system is gov erned by four spatial mo des exp eriencing in termittent oscillations in the temp oral direction. The top row shows the clean system, while the the b ottom row shows the noisy system with a signal-to-noise ratio of 2 . 15 Erichson, Manohar, Brunton, Kutz mo des time dynamics (a) rCP ( q = 2 ). (b) rCP ( q = 0 ). (c) SVD Figure 7: T oy video decomp osition results. The randomized CP with pow er iterations ( q = 2 ) accurately reconstructs the original spatiotemp oral dynamics from noise-corrupted data, while SVD and rCP without p o wer iterations ( q = 0 ) yield p oor reconstruction results. decomp osition techniques such as the SVD are kno wn to face diﬃculties capturing suc h in termitten t, m ultiscale structure. Figure 7 displays the decomp osition results of the noisy (SNR= 2 ) to y video for b oth the randomized CP decomp osition and the SVD. The ﬁrst subplot sho ws the results of a rank k = 4 approximation computed using the rCP algorithm with q = 2 p o w er iterations, and a small o v ersampling parameter p = 10 . The metho d faithfully captures the underlying spatial mo des and the time dynamics. F or illustration, the second subplot shows the decomp osition results without additional pow er iterations. It can clearly be seen that this approach in tro duces distinct artifacts, and the appro ximation qualit y is relativ ely po or. The SVD, sho wn in the last subplot, demonstrates p oor p erformance at separating the modes and mixes the spatiotemp oral dynamics of modes 2 & 3. T able 1 further quantiﬁes the observ ed results. In terestingly , the relative error using the randomized algorithm with q = 2 p o wer iterations is slightly b etter than the deterministic algorithm. This is due to the intrinsic regularizing b eha vior of randomization. Ho wev er, the P arameters Time (s) Speedup Iterations Error CP BCD k = 4 2.31 - 9 0.0171 rCP BCD k = 4 , p = 10 , q = 0 0.13 17 9 0.494 k = 4 , p = 10 , q = 1 0.14 16 10 0.0191 k = 4 , p = 10 , q = 2 0.15 15 10 0.0164 SVD k = 4 0.52 - - 0.137 T able 1: Summary of the computational results for the noisy toy video. 16 Randomized CP Tensor Decomposition reconstruction error without p ow er iterations is large, as is the error resulting from the SVD. The achiev ed sp eedup of randomized CP is substantial, with a sp eedup factor of ab out 15 . A Note on Compression. The CP decomp osition provides a more parsimonious repre- sen tation of the data. Comparing the compression ratios betw een the CP decomp osition and SVD illustrates the diﬀerence. F or a rank R = 4 tensor of dimension 100 × 100 × 100 , the compression ratios are c S V D = I · J · K R · ( I · J + K + 1) = 100 3 4 · (100 2 + 100 + 1) ≈ 24 . 75 , c C P = I · J · K R · ( I + J + K + 1) = 100 3 4 · (100 + 100 + 100 + 1) ≈ 830 . 56 . Note, the SVD requires the tensor to b e reshap ed in some direction. The comparison illustrates the striking diﬀerence b et ween compression ratios. It is eviden t that the CP decomp osition requires computing man y fewer co eﬃcien ts in order to approximate the tensor. Th us, less storage is required to approximate the data. While the CP decomp osition yields a parsimonious approximation that is more robust to noise, the adv antage of the SVD is that data can b e approximated with an accuracy as low as mac hine precision. 4.2.2 Flow behind a cylinder Extracting the dominan t coherent structures from ﬂuid ﬂows helps to b etter characterize them for mo deling and con trol (Brunton and Noack, 2015). The workhorse algorithm in ﬂuid dynamics and for mo del reduction is the SVD. Ho wev er, ﬂuid sim ulations generate high-resolution spatiotemp oral grids of data whic h naturally manifest as tensors. In the follo wing we examine the suitability of the CP decomposition for decomp osing ﬂo w data, and compare the results to those of the SVD. The ﬂuid ﬂow b ehind a cylinder, a canonical example in ﬂuid dynamics, is presen ted here. The data are a time-series generated from a simulation of ﬂuid v orticity b ehind a stationary cylinder (Colonius and T aira, 2008). The corresp onding ﬂo w tensor has dimension 199 × 449 × 151 , consisting of 151 snapshots on a 449 × 199 spatial grid. Figure 8 shows a few example snapshots of the ﬂuid ﬂow. The ﬂow is characterized by a p erio dically shedding wak e time Figure 8: Snapshots of the ﬂuid ﬂow b ehind a cylinder. 17 Erichson, Manohar, Brunton, Kutz mo des time dynamics (a) Randomized CP ( q = 2 ). (b) SVD. Figure 9: Fluid ﬂow decomp osition, no noise. Both metho ds capture the same dominant frequencies from the time dynamics, while randomized CP requires more rank-1 outer pro ducts in x and y to represen t single frequency spatial dynamics. structure and is inheren tly lo w-rank in the absence of noise. The c haracteristic frequencies of ﬂow oscillations o ccur in pairs, reﬂecting the complex-conjugate pairs of eigen v alues corresp onding to harmonics in the temporal direction. Results in Absence of White Noise. Figure 9 sho ws both the approximated spatial mo des and the temporal dynamics for the randomized CP decomp osition and the SVD. Observ e that b oth metho ds extract similar patterns, although unlike SVD, rCP spatial mo des are sparse on the domain. The reason is tw o-fold: (i) CP does not impose orthogonalit y constrain ts (which in general reveal dense structure), and (ii) CP imp oses rank-one outer pro duct structure in the x and y directions via the columns of the factor matrices. In doing so, CP isolates the con tributions of single wa ven um b ers (spatial frequencies) to the steady-state vortex shedding regime. These are commonly analyzed to study energy transfer b et ween scales in complex ﬂows and turbulence (Menev eau and Katz, 2000; Sharma and P arameters Time (s) Speedup Iterations Error CP BCD k = 30 115.55 - 458 0.117 rCP BCD k = 30 , p = 10 , q = 0 1.27 91 533 0.122 k = 30 , p = 10 , q = 1 1.41 82 517 0.121 k = 30 , p = 10 , q = 2 1.56 74 437 0.118 SVD k = 30 0.57 - - 4.25E-05 T able 2: Summary of the computational results for the noise-free cylinder ﬂow. 18 Randomized CP Tensor Decomposition mo des time dynamics (a) Randomized CP ( q = 2 ). (b) SVD. Figure 10: Fluid ﬂow decomp osition noisy (SNR=2). Randomized CP mo des are robust to additive noise and SVD spatial mo des are corrupted by noise. McKeon, 2013), hence CP can help rev eal new ph ysically meaningful insights. In particular, randomization enables tensor decomp osition of high-resolution v ector ﬁelds that otherwise ma y not b e tractable using deterministic metho ds. Thus rCP provides a nov el decomp osition of m ultimo dal ﬁelds for downstream tasks in mo del reduction, pattern extraction and control. Note, that for a ﬁxed target rank of k = 30 across all metho ds, the SVD ac hieves a substan tially lo w er reconstruction error (see T able 2). Ho w ev er, the compression ratios for the CP and SVD metho ds are c C P ≈ 562 . 17 and c S V D ≈ 5 . 02 , i.e., the CP compresses the data nearly tw o orders of magnitude more. Results in Presence of White Noise. Next, the analysis of the same ﬂo w is rep eated in the presence of additiv e white noise. While this is not of concern when dealing with ﬂo w sim ulations, it is realistic when dealing with ﬂows obtained from measuremen t. W e chose a signal-to-noise ratio of 2 to demonstrate the robustness of the CP decomp osition to noise. Figure 10 sho ws again the corresp onding dominant spatial mo des and temp oral dynamics. Both the SVD and the CP decomp osition faithfully capture the temp oral dynamics. How ever, comparing the mo des of the SVD to Figure 9, it is apparen t that the spatial mo des contain a large amount of noise. The spatial mo des rev ealed b y the CP decomp osition pro vide a signiﬁcan tly better approximation. Again, it is crucial to use p o wer iterations to achiev e a go od appro ximation quality (see T able 3). By inspection, the relativ e reconstruction error using the SVD is p o or compared to the error achiev ed using the CP decomp osition. Here, we sho w the error for a rank k = 30 and k = 6 approximation. The target rank w as determined using the optimal hard threshold for singular v alues (Gavish and Donoho, 2014). 19 Erichson, Manohar, Brunton, Kutz P arameters Time (s) Speedup Iterations Error CP BCD k = 30 64.01 - 239 0.191 rCP BCD k = 30 , p = 10 , q = 0 0.99 64 332 0.522 k = 30 , p = 10 , q = 1 1.23 52 414 0.189 k = 30 , p = 10 , q = 2 1.13 56 370 0.153 SVD k = 30 0.58 - - 0.655 k = 6 0.58 - - 0.311 T able 3: Summary of the computational results for the noise-corrupted cylinder ﬂow. The CP decomposition ov ercomes this disadv antage, and is able to appro ximate the ﬁrst k = 30 modes with on ly a slight loss of accuracy . Note that here the randomized CP decomp osition p erforms better than the deterministic algorithm. W e assume that this is due to the fa v orable in trinsic regularization eﬀect of randomized metho ds. 5. Conclusion The emergence of massive tensors require eﬃcient algorithms for obtaining tensor decom- p ositions. T o address this c hallenge, w e hav e presen ted a randomized algorithm which substan tially reduces the computational demands of the CP decomp osition. Indeed, random- ized algorithms ha v e established themselves as highly comp etitive metho ds for computing traditional matrix decomp ositions. A key adv antage of the randomized algorithm is that mo dern computational architectures are fully exploited. Th us, the algorithm b eneﬁts sub- stan tially from m ultithreading in a multi-core pro cessor. In contrast to previously prop osed high-p erformance tensor algorithms whic h are based on computational concepts suc h as distributed computing, our prop osed randomized algorithm provides substantial computa- tional speedups even on standard desktop computers. Our prop osed algorithm ac hieves these sp eedups by reducing the computational costs p er iteration, whic h enables the user to decompose real w orld examples that typically require a large n um b er of iterations to con v erge. In addition to computational savings, the randomized CP decomposition demonstrates outstanding performance on sev eral examples using artiﬁcial and real-world data, including decomp ositions of high-resolution ﬂo w ﬁelds that ma y not b e tractable with deterministic metho ds. Moreo ver, our exp erimen ts show that the p o w er iteration concept is crucial in order to achiev e a robust tensor decomp osition. Th us, our algorithm has a practical adv antage o v er previous randomized tensor algorithms, at a slightly increased computational cost due to additional p o w er iterations. A c kno wledgments W e would like to thank Alex Williams, and Michael W. Mahoney for insigh tful discussion on tensor decomp ositions and randomized numerical linear algebra. F urther, we would lik e to express our gratitude to the tw o anonymous reviewers for their v aluable feedback, which help ed us greatly improv e the man uscript. NBE would like to ackno wledge DARP A and NSF for providing partial support of this work. KM ac kno wledges supp ort from NSF MSPRF 20 Randomized CP Tensor Decomposition A w ard 1803663. JNK ackno wledges support from the Air F orce Oﬃce of Scientiﬁc Research (AF OSR) gran t F A9550-17-1-0329. SLB ackno wledges funding supp ort from the Air F orce Oﬃce of Scientiﬁc Researc h (AF OSR) grant F A9550-18-1-0200. Data a v ailability statemen t The data that supp ort the ﬁndings of this study are a v ailable up on request. App endix A. Pro of of Theorem 1 In the following, we sketc h a pro of for Theorem 1, which yields an upper b ound for the appro ximate basis for the range of a tensor. T o assess the quality of the basis matrices { Q n } N n =1 , w e ﬁrst show that the problem can b e expressed as a sum of subproblems. Deﬁning the residual error k E k F = k X − ˆ X k = k X − X × 1 Q 1 Q > 1 × 2 · · · × N Q N Q > N k F . (21) Note that the F rob enius norm of a tensor and its matricized forms are equiv alent. Deﬁning the orthogonal pro jector P n ≡ Q n Q > n , we can reform ulate (21) compactly as k E k F = k X − X × 1 P 1 × 2 · · · × N P N k F . (22) Pro of Assuming that P n yields an exact pro jection on to the column space of the matrix Q n , we need to show ﬁrst that the error can b e expressed as a sum of the errors of the n pro jections k E k F = N X n =1 k X − X × n P n k F = N X n =1 k X × n ( I − P n ) k F , (23) where I denotes the identit y matrix. F ollo wing Drineas and Mahoney (2007), let us add and subtract the term X × N P N in Equation (22) so that we obtain k E k F = k X − X × N P N + X × N P N − X × 1 P 1 × 1 · · · × N P N k F (24) ≤ k X − X × N P N k F + k X × N P N − X × 1 P 1 × 1 · · · × N P N k F (25) = k X − X × N P N k F + k ( X − X × 1 P 1 × 1 · · · × N − 1 P N − 1 ) × N P N k F (26) ≤ k X − X × N P N k F + k X − X × 1 P 1 × 1 · · · × N − 1 P N − 1 k F . (27) The bound (25) follo ws from the triangular inequalit y for a norm. Next, the common term P N is factored out in Equation (26) . Then, the b ound (27) follo ws from the properties of orthogonal pro jectors. This is b ecause the r ange ( X × 1 P 1 × 1 · · · × N − 1 P N − 1 ) ⊂ r ang e ( X × 1 P 1 × 1 · · · × N P N ) , and then it holds that k X − X × 1 P 1 × 1 · · · × N P N k F ≤ k X − X × 1 P 1 × 1 · · · × N − 1 P N − 1 k F . See Prop osition 8.5 by Halko et al. (2011) for a pro of using matrices. Subsequen tly the residual error E N − 1 can b e b ounded k E N − 1 k ≤ k X − X × N − 1 P N − 1 k F + k X − X × 1 P 1 × 1 · · · × N − 2 P N − 2 ) k F . (28) F rom this inequality , Equation (23) follo ws. W e take the exp ectation of Equation (23) E k E k F = E " N X n =1 k X × n ( I − P n ) k F # . (29) 21 Erichson, Manohar, Brunton, Kutz Recalling that Theorem 10.5 formulated b y Halko et al. (2011) states the following exp ected appro ximation error (formulated here using tensor notation) E k X × n ( I − P n ) k F ≤ s 1 + k p − 1 · s X j >k σ 2 j , (30) assuming that the sketc h in Equation (7) is constructed us ing a standard Gaussian matrix Ω . Here σ j denotes the singular v alues of the matricized tensor X ( n ) greater then the chosen tar- get rank k . Combining Equations (29) and (30) then yields the results of the theorem (11). Figures 11 ev aluates the theoretical upper b ound ov er 100 runs for b oth a third and fourth order random lo w-rank R = 25 tensor. Here, we use a ﬁxed ov ersampling parameter p = 2 . The results show that the empirical error is faithfully b ounded by the theoretical upp er b ound for v arying target ranks. (a) T ensor of dimension 50 × 50 × 50 . (b) T ensor of dimension 50 × 50 × 50 × 50 . Figure 11: Empirical ev aluation of the theoretical upp er b ound. References D. Ac hlioptas and F. McSherry . F ast computation of lo w-rank matrix approximations. Journal of the A CM , 54:9, 2007. B. W. Bader and T. G. Kolda. MA TLAB tensor to olbox version 2.6, 2015. URL http://www.sandia. gov/~tgkolda/TensorToolbox/ . C. Battaglino, G. Ballard, and T. G. Kolda. A practical randomized cp tensor decomp osition. SIAM Journal on Matrix A nalysis and A pplic ations , 39(2):876–901, 2018. S. L. Brunton and B. R. Noack. Closed-lo op turbulence control: Progress and challenges. A pplie d Me chanics R eviews , 67:1–60, 2015. 22 Randomized CP Tensor Decomposition J. D. Carroll and J.-J. Chang. Analysis of individual diﬀerences in m ultidimensional scaling via an N-w ay generalization of ‘Eckart-Young’ decomp osition. Psychometrika , 35:283–319, 1970. A. Cic ho c ki and A. H. Phan. F ast lo cal algorithms for large scale nonnegative matrix and tensor fac- torizations. IEICE T r ansactions on F undamentals of Ele ctr onics, Communic ations and Computer Scienc es , 92:708–721, 2009. A. Cichocki, R. Zdunek, A. Phan, and S. Amari. Nonne gative matrix and tensor factorizations: Applic ations to explor atory multi-way data analysis and blind sour c e sep ar ation . John Wiley & Sons, 2009. T. Colonius and K. T aira. A fast immersed b oundary metho d using a nullspace approach and multi- domain far-ﬁeld b oundary conditions. Computer Metho ds in A pplie d Me chanics and Engine ering , 197:2131–2146, 2008. P . Comon, X. Luciani, and A. De Almeida. T ensor decomp ositions, alternating least squares and other tales. Journal of Chemometrics , 23:393–405, 2009. P . Drineas and M. W. Mahoney . A randomized algorithm for a tensor-based generalization of the singular v alue decomp osition. Line ar Algebr a and its A pplic ations , 420:553–571, 2007. P . Drineas and M. W. Mahoney . RandNLA: Randomized n umerical linear algebra. Communic ations of the A CM , 59:80–90, 2016. N. B. Erichson, S. V oronin, S. L. Brun ton, and J. N. Kutz. Randomized matrix decomp ositions using R. arXiv pr eprint arXiv:1608.02148 , 2016. A. F rieze, R. Kannan, and S. V empala. F ast Monte-Carlo algorithms for ﬁnding low-rank approxima- tions. J. A CM , 51:1025–1041, 2004. M. Gavish and D. L. Donoho. The optimal hard threshold for singular v alues is 4 / √ 3 . IEEE T r ansactions on Information The ory , 60(8):5040–5053, 2014. M. Gu. Subspace iteration randomization and singular v alue problems. SIAM Journal on Scientiﬁc Computing , 37(3):1139–1173, 2015. doi: 10.1137/130938700. N. Halko, P . G. Martinsson, and J. A. T ropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decomp ositions. SIAM R eview , 53(2):217–288, 2011. doi: 10.1137/090771806. R. A. Harshman. F oundations of the P ARAF A C pro cedure: Models and conditions for an "explanatory" m ulti-mo dal factor analysis. T echnical Rep ort 16, W orking Papers in Phonetics, UCLA, 1970. F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of pro ducts. Journal of Mathematic al Physics , 6:164–189, 1927. D. Hong, T. G. Kolda, and J. A. Duersch. Generalized canonical p olyadic tensor decomp osition. SIAM R eview , 62(1):133–163, 2020. doi: 10.1137/18M1203626. URL https://doi.org/10.1137/ 18M1203626 . E. Jones, T. Oliphant, and P . Peterson. SciPy: Op en source scientiﬁc to ols for Python, 2001. URL https://www.scipy.org/ . J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and tensor factorizations: A uniﬁed view based on blo c k co ordinate descent framew ork. Journal of Glob al Optimization , 58:285–319, 2014. T. G. Kolda and B. W. Bader. T ensor decomp ositions and applications. SIAM R eview , 51:455–500, 2009. N. Li, S. Kindermann, and C. Nav asca. Some conv ergence results on the regularized alternating least-squares metho d for tensor decomp osition. Line ar A lgebr a and its A pplic ations , 438:796–812, 2013. 23 Erichson, Manohar, Brunton, Kutz E. Lib erty , F. W o olfe, P .-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-rank approximation of matrices. Pr o c e e dings of the National A c ademy of Scienc es , 104: 20167–20172, 2007. M. W. Mahoney . Randomized algorithms for matrices and data. F oundations and T r ends in Machine L e arning , 3:123–224, 2011. P .-G. Martinsson. Randomized metho ds for matrix computations and analysis of high dimensional data. arXiv pr eprint arXiv:1607.01649 , 2016. P .-G. Martinsson, V. Rokhlin, and M. Tygert. A randomized algorithm for the decomposition of matrices. Applie d and Computational Harmonic A nalysis , 30:47–68, 2011. C. Meneveau and J. Katz. Scale-inv ariance and turbulence mo dels for large-eddy simulation. A nnual R eview of Fluid Me chanics , 32(1):1–32, 2000. A. Phan and A. Cic ho c ki. P ARAF A C algorithms for large-scale problems. Neur o c omputing , 74: 1970–1984, 2011. V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal comp onen t analysis. SIAM Journal on Matrix A nalysis and A pplic ations , 31:1100–1124, 2009. A. Sharma and B. McKeon. On coherent structure in wall turbulence. Journal of Fluid Me chanics , 728:196–238, 2013. N. Sidirop oulos, E.Papalexakis, and C. F aloutsos. P arallel randomly compressed cub es: A scalable distributed architecture for big tensor decomp osition. IEEE Signal Pr o c essing Magazine , 31:57, 2014. A. Szlam, Y. Kluger, and M. Tygert. An implementation of a randomized algorithm for principal comp onen t analysis. arXiv pr eprint arXiv:1412.3510 , 2014. C. E. T sourakakis. MA CH: Fast randomized tensor decomp ositions. In Pr o c. 2010 SIAM Int. Conf. Data Mining , pages 689–700, 2010. A. Uschmajew. Lo cal conv ergence of the alternating least squares algorithm for canonical tensor appro ximation. SIAM Journal on Matrix A nalysis and A pplic ations , 33:639–652, 2012. N. V ervliet and L. D. Lathauw er. A randomized blo c k sampling approach to canonical p oly adic decomp osition of large-scale tensors. IEEE Journal of Sele cte d T opics in Signal Pr o c essing , 10: 284–295, 2016. N. V ervliet, O. Debals, L. Sorb er, and L. D. Lathauw er. Breaking the curse of dimensionality using decomp ositions of incomplete tensors: Tensor-based scien tiﬁc computing in big data analysis. IEEE Signal Pr o c essing Magazine , 31:71–79, 2014. S. V oronin and P .-G. Martinsson. RSVDP ACK: Subroutines for computing partial singular v alue decomp ositions via randomized sampling on single core, m ulti core, and GPU architectures. arXiv pr eprint arXiv:1502.05366 , 2015. L. W ang and M. T. Ch u. On the global con vergence of the alternating least squares metho d for rank-one approximation to generic tensors. SIAM Journal on Matrix A nalysis and A pplic ations , 35:1058–1072, 2014. R. Witten and E. Candes. Randomized algorithms for lo w-rank matrix factorizations: sharp p erformance b ounds. A lgorithmic a , 72:264–281, 2015. F. W o olfe, E. Lib ert y , V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation of matrices. Journal of A pplie d and Computational Harmonic A nalysis , 25:335–366, 2008. Y. Xu and W. Yin. A blo c k co ordinate descent metho d for regularized multicon v ex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Scienc es , 6(3):1758–1789, 2013. G. Zhou, A. Cichocki, and S. Xie. Decomp osition of big tensors with low m ultilinear rank. arXiv pr eprint arXiv:1412.1885 , 2014. 24

Randomized CP Tensor Decomposition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment