A GPU-Oriented Algorithm Design for Secant-Based Dimensionality Reduction

A GPU-Oriented Algorithm Design for Secant-Based Dimensionality Reduction Henry Kvinge, Elin Farnell, Michael Kirby , and Chris Peterson Department of Mathematics Colorado State Univ ersity Fort Collins, CO 80523-1874 c  2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Abstract —Dimensionality-reduction techniques are a funda- mental tool for extracting useful inf ormation from high- dimensional data sets. Because secant sets encode manifold geometry , they are a useful tool for designing meaningful data- reduction algorithms. In one such approach, the goal is to construct a projection that maximally a voids secant dir ections and hence ensur es that distinct data points ar e not mapped too close together in the reduced space. This type of algorithm is based on a mathematical framework inspired by the constructive proof of Whitney’ s embedding theorem from differential topol- ogy . Computing all (unit) secants for a set of points is by nature computationally expensiv e, thus opening the door for exploitation of GPU architecture for achieving fast versions of these algo- rithms. W e present a polynomial-time data-reduction algorithm that produces a meaningful low-dimensional representation of a data set by iteratively constructing improved projections within the framework described above. Key to our algorithm design and implementation is the use of GPUs which, among other things, minimizes the computational time requir ed f or the calculation of all secant lines. One goal of this report is to shar e ideas with GPU experts and to discuss a class of mathematical algorithms that may be of interest to the broader GPU community . I . I N T R O D U C T I O N High performance computing architectures and massiv e data centers have created a modern data challenge. Supercomputers generate data at ever more amazing rates, in extreme cases upwards of 20,000,000 cores processing power and processing speeds approaching 100 petaﬂops. Also, advances in data acquisition have us in uncharted territory; e.g., the Australian Square Kilometre Array Pathﬁnder is generating data mea- sured in petabytes per hour . How does one begin to interrogate data sets of this magni- tude? In this paper we propose that one of the ﬁrst questions one can pose about data is mathematical, i.e., what is the dimension of the data? Are fast dimension estimates of mas- siv e data sets feasible? What is the role of high-performance computing and parallel architectures for dimension-estimation algorithms? Examining all points, or large subsets of points, to identify optimal representation subspaces appears inherently parallelizable. The dimension question provides a window into the complexity of the data and opportunities for data reduction. Even the question of monitoring these high-performance computing systems is a challenge. Mathematical models for anomaly detection provide an interesting direction, b ut these approaches also suffer from the curse of dimensionality [1], [2]. Characterizing the behavior of a 100,000,000 variable dynamical system requires new modeling strategies. Again, understanding the dimension of the data may provide the ﬁrst step to making the modeling procedure tractable [3]. The outline of this paper is as follo ws: in Section II, we discuss our proposal for an algorithm that provides both a meaningful projection for dimensionality reduction and a means of estimating the dimension of a data set. W e cover foundational mathematical background in Section III. W e present the speciﬁcs of the proposed algorithm in Section IV and we demonstrate its use on a synthetic example in Section V. In Section VI, we use the algorithm to estimate the dimension of three data sets: a synthetic data set, a data set of digital images, and a hyperspectral data set. Finally , in Section VII we suggest a modiﬁed version of the algorithm for noisy data. I I . O V E RV I E W High-dimensional data sets can be both a computational burden and dif ﬁcult to analyze. Data-reduction algorithms offer a way to reduce these dif ﬁculties by mapping the data set into a lower -dimensional space with the goal of retaining as much information as possible. A classical example of such an algorithm is principal component analysis (PCA). In [3], [4], [5], Broomhead and Kirby dev eloped a new frame work for data reduction based on Whitney’ s embedding theorem, a theo- rem from differential topology which gives an upper bound on the dimension of Euclidean space required to smoothly embed a compact k -dimensional manifold [6, Section 1.8]. The goal of this framework is to produce projections which not only retain differential structure b ut also ha ve a well-conditioned in verse (roughly , an in verse for which small changes in the domain produce similarly small changes in the range). This property , not necessarily found for projections obtained from other popular methods such as PCA, means that the projection provides a method to compress and decompress the data without loss of information. In practice, producing projections for a data set X ⊂ R n into R m for m < n within the Whitney reduction framework in volv es ﬁnding projections P : X ⊂ R n → R m for which the smallest value of || P ( x 1 ) − P ( x 2 ) || ` 2 || x 1 − x 2 || ` 2 for x 1 , x 2 ∈ X is maximized. This can be more concisely described as follows: let S be the set of all secants of X (that is, the set of differences of all distinct points in X ). Then we seek a projection P that best preserves S . In Section IV of this paper we describe a new algorithm to accomplish this, which we call Secant-A voidance Projection (SAP). Because the number of secants for a data set of size n is n ( n − 1) / 2 , even relati vely small data sets can hav e large corresponding secant sets. For this reason, the use of GPU architecture is key to the design of this algorithm. Calculation of the secant set itself can be cumbersome on a CPU as the size of n becomes large. Furthermore, many of the steps in the algorithm in v olve small independent calculations with each secant in S . These can be tri vially parallelized and implemented efﬁciently on a GPU. W e note that our algorithm also pro vides a method of approximating the dimension of a data set. W e outline how this is done in Section VI and provide applications in this context. I I I . B AC K G RO U N D T H E O RY In this section we discuss some of the background related to the practical question of dimension estimation [7]. A. T opological Dimension For data sampled from smooth manifolds, e.g, intervals, circles, spheres and tori, the space looks locally ﬂat and it is only at larger scales that the curvature becomes apparent. The dimension of this locally ﬂat space, i.e., the tangent space, is taken to be the dimension of the manifold (which is the same at ev ery point on the manifold). Following [8], we introduce the  -tangent space determined by the singular vectors associated with the singular values of the singular v alue decomposition (SVD) that scale linearly for small  . It is this subset of the basis elements that determine the tangent space and the topological dimension of the manifold [9]. Theorem 1. [10] The basis vectors whose singular spectrum scales as  for small  form a basis for the tangent space center ed at p of an m -dimensional manifold. i.e., those basis vectors for which the set of singular values { σ i } of an  -ball center ed at p have σ i ( p,  ) = k  + O (  2 ) , 1 ≤ i ≤ m for some constant k form a basis for the tangent space at p. It is implicit in the above statement that any basis vector of the local SVD that does not scale linearly li ves in a space per- pendicular to the tangent space. This is a tractable computation to approximate dimension, especially in low dimensions; see, e.g., [11], [12]. Howe ver , in theory it requires the computation of SVDs ov er many scales at each point on the data set. B. Whitne y’ s Theor em A foundational result in differential topology , Whitney’ s embedding theorem gi ves an upper bound on the dimen- sion of Euclidean space needed to smoothly embed a ﬁnite- dimensional compact dif ferentiable manifold. Recall that an immersion is a differentiable function f : M → N between two dif ferentiable manifolds M and N , whose deriv ativ e is ev erywhere injectiv e. Theorem 2. [6] Let M be an m -dimensional differ entiable compact manifold. Then ther e exists an embedding into R 2 m +1 . The proof of this theorem in [13] uses the construction of the secant set S and ar gues that the projection has an in v erse as long as the point of projection does not lie on or inﬁnitesimally close to any secant line of M . C. Bi-Lipschitz Criterion A function f ( x ) is said to be bi-Lipsc hitz on S if for all x, y ∈ S it holds that m 1 k x − y k ` 2 ≤ k f ( x ) − f ( y ) k ` 2 ≤ m 2 k x − y k ` 2 . (1) The constants m 1 , m 2 can be interpreted as follows: • m 1 is the injectivity/immersi vity parameter and prevents pairs of points from collapsing, • m 2 is the Lipschitz constant and prev ents pairs of points from blowing apart. The characterization of a function as bi-Lipschitz is con- nected to dimension estimation [14], [15]. Theorem 3. [16] If a function f : S → T is bi-Lipschitz, then dim( S ) = dim( T ) , wher e the dimension can be taken as the topological dimen- sion, or the Hausdorff dimension. Theorem 4. [4] If a function f ( x ) is bi-Lipschitz, then the in verse function f − 1 is also bi-Lipschitz with injectivity parameter 1 /m 2 and Lipschitz constant 1 /m 1 , i.e., 1 m 2 k x 0 − y 0 k ` 2 ≤ k f − 1 ( x 0 ) − f − 1 ( y 0 ) k ` 2 ≤ 1 m 1 k x 0 − y 0 k ` 2 . T ypically reduction mappings are not optimized for recon- struction and may have ill-conditioned in verses. W e note that when f is a projection, the upper bound in (1) is automatically satisﬁed and only the lower bound needs to be checked. I V . T H E S E C A N T - A V O I DA N C E P RO J E C T I O N A L G O R I T H M In this section we describe our algorithm for data reduction. Giv en a data set of k points X = { x (1) , . . . , x ( k ) } ⊂ R n we want to construct a projection Pr : R n → R m such that the distances between points in X are preserved. T o accomplish this, we propose, for all pairs x ( i ) , x ( j ) ∈ X, to calculate and store the corresponding normalized secant s := x ( i ) − x ( j ) || x ( i ) − x ( j ) || ` 2 . Note that we normalize each of these secants so that when we compare their projections, those corresponding to close points and far points are on an equal footing. The calculation of all secants for a data set is trivially parallelizable and is an ideal task for a GPU. W e store secants as the columns of an n × p matrix S and for iteration i of our algorithm we realize our current projection Pr ( i ) via an n × m matrix P ( i ) whose columns are orthonormal vectors spanning an m -dimensional subspace M ⊂ R n . W e write these column v ectors as p ( i ) 1 , p ( i ) 2 , . . . , p ( i ) m , and P ( i ) = [ p ( i ) 1 | p ( i ) 2 | · · · | p ( i ) m ] . The projection Pr ( i ) ( s ) of secant s can then be computed as P ( i ) T s . Our algorithm is designed to solve the optimization problem P ∗ = arg max P  min s ∈ S || P T s || ` 2  where the maximum is taken ov er all projections P from R n to an m -dimensional subspace. As our initial projection we choose P (1) = [ p (1) 1 | p (1) 2 | · · · | p (1) m ] to be the ﬁrst m columns of U , where S = U Σ V T is the singular value decomposition of S. In other words, Pr (1) is the projection giv en by PCA. After initializing P (1) , the algorithm proceeds as follows: at each iteration i we calculate the ` 2 -norms of all secants under the current projection. In other words we ﬁnd the column of P ( i ) T S with the smallest ` 2 norm; call this column index j ∗ . The column index j ∗ corresponds to the index of the secant s j ∗ as a column in S that is least well preserved by Pr ( i ) . W e construct P ( i +1) from P ( i ) by rotating P ( i ) by a small amount tow ard the direction speciﬁed by the se- cant s j ∗ . In order to shift P ( i ) tow ard this secant we ﬁrst calculate the projection of this secant P ( i ) P ( i ) T s j ∗ in R n . Let p ( i ) t be the vector from p ( i ) 1 , . . . , p ( i ) m which maxi- mizes |h p ( i ) k , s j ∗ i| . If all these quantities are zero then we pick p ( i ) t = p ( i ) 1 . After selecting p ( i ) t , we run the mod- iﬁed Gram-Schmidt algorithm on the m ordered vectors n P ( i ) P ( i ) T s j ∗ , p ( i ) 1 , p ( i ) 2 , . . . , p ( i ) t − 1 , p ( i ) t +1 , . . . , p ( i ) m o to obtain a new set of orthonormal vectors ( P ( i ) P ( i ) T s j ∗ k P ( i ) P ( i ) T s j ∗ k ` 2 , p ( i +1) 2 , p ( i +1) 3 , . . . , p ( i +1) m ) which has the same span but now necessarily contains the normalized projection of s j ∗ . W e then set p ( i +1) 1 to be the unit vector in the direction of (1 − α ) P ( i ) P ( i ) T ( s j ∗ ) + α ( s j ∗ − P ( i ) P ( i ) T ( s j ∗ )) , where α ∈ [0 , 1] is a small constant that controls the amount that our projection shifts at each step. The results presented in this paper were computed using α = 0 . 01 , where this value is chosen to ensure that a minor shift is made in each step and to encourage conv ergence. Experimentally , a value of α = 0 . 01 appears to achiev e those goals. W e summarize the Secant-A voidance Projection algorithm in Algorithm 1. The computational complexity of the SAP algorithm is dominated by the complexity of the SVD and Gram-Schmidt algorithms as well as that of the determination of the secant with the shortest projected norm. A standard implementation approach would result in completion in time O ( n 5 ) . Ho wev er , there are sev eral ways in which one might improv e the asymptotic bound. For example, Monte Carlo approaches to the computation of the SVD and smart updates to the lengths of the projected secants stand to improve the time complexity . Such explorations are a topic for future research. Algorithm 1 Secant-A v oidance Projection 1: inputs Secant set S, desired dimension of embedding m, max number of steps (Iterations) or alternative stopping criterion, and shift parameter α . 2: Initialize P (1) to be the matrix formed by the ﬁrst m columns of U , where S = U Σ V T . 3: f or i ≤ Iterations do 4: Set s j ∗ = arg min s ∈ S || ( P ( i ) ) T s || ` 2 . 5: Set p ( i ) t = arg max 1 ≤ k ≤ m |h p ( i ) k , s j ∗ i| . 6: Apply the modiﬁed Gram-Schmidt algorithm to P ( i ) P ( i ) T s j ∗ , p 1 , . . . , p t − 1 , p t +1 , . . . , p m to obtain a new orthonormal basis P ( i ) P ( i ) T s j ∗ k P ( i ) P ( i ) T s j ∗ k ` 2 , p ( i +1) 2 , . . . , p ( i +1) m . 7: Set p ( i +1) 1 to be the unit vector in the direction of (1 − α ) P ( i ) P ( i ) T ( s j ∗ ) + α ( s j ∗ − P ( i ) P ( i ) T ( s j ∗ )) . 8: i + 1 ← i 9: end for 10: retur n Pr ( i ) V . A S Y N T H E T I C E X A M P L E : T R I G O N O M E T R I C M O M E N T C U RV E S In this section we show an example of an application of the SAP algorithm and we compare the results to other common projection methods. W e construct a synthetic data set by sampling points from the trigonometric moment curve φ : R → R 10 deﬁned by φ ( t ) := (cos( t ) , sin( t ) , cos(2 t ) , . . . , cos(5 t ) , sin(5 t )) . For this example and all following examples, we use a program written in CUD A 8.0 [17]. On two Nvidia T esla K80 graphics processing units, we • construct the secant set for the relev ant data set, • calculate the singular value decomposition for the matrix of all normalized secant vectors using the cuSolver library [18]. • run the algorithm for 100 iterations. For some of the following examples we ﬁx a dimension, and for others, we repeat for various projection dimensions. Figure 1 shows a projection of the sampled trigonometric moment curve data onto the ﬁrst three coordinates in R 10 . Figure 2 sho ws the projection into R 3 obtained from PCA (i.e. the projection of the data from R 10 to R 3 via Pr (1) ), and Figure 3 sho ws the projection onto R 3 obtained from 100 iterations of the Secant-A voidance Projection algorithm. In each case, Fig. 1: The projection of the function φ onto the ﬁrst three coordinate directions in R 10 . Clearly , much information is lost by ignoring the content in the remaining 7 coordinates. Compare to the projections into R 3 provided by PCA, which maximizes v ariance, and to SAP , which seeks a projection with a smooth in verse (Figures 2 and 3, respectiv ely). Fig. 2: The projection of the function φ into R 3 via PCA. Recall that the projection provided by PCA is the projection that captures maximal variance in the data. we drew 12,800 values from a uniform random distribution on [0 , 2 π ] to get a sampling of points on the trigonometric moment curve. Qualitativ ely , as seen in Figures 1, 2, and 3, the three methods of constructing projections ha ve captured different aspects of the data set through their projections into R 3 . From the perspecti ve of the norm of the shortest projected secant, we ﬁnd that SAP outperforms both the na ¨ ıve projection and the PCA projection: the norms of the shortest projected secants, in order , are 0 . 1013 , 0 . 0466 , and 0 . 1677 . It is worth noting that PCA and SAP are methods of projection deﬁned in terms of particular optimization problems. So while PCA, by construction, will maximize variance, SAP aims to maximize the norm of the shortest projected secant. W e see in this Fig. 3: The projection of the function φ into R 3 via SAP . Recall that SAP seeks to produce a projection that maximizes the minimum projected norm across all secants in the data set. example that SAP has indeed outperformed PCA with respect to this metric, as it is designed to do. V I . A P P R OX I M A T I N G T H E D I M E N S I O N O F A DAT A S E T A. A synthetic example Even in the case where one is not interested in ﬁnding a speciﬁc projection, the SAP algorithm can be used to better understand the nature of the data. Speciﬁcally , it can giv e a good sense of the dimensionality of the data. One way to observe this is to run the algorithm for a range of different projection dimensions. One should see the norm of the shortest secant for each projection begin to dramatically increase as the projection dimension increases. V ia Whitney’ s theorem, the dimension at which this occurs leads to a good approximation of the dimension of the data. As a synthetic example we sampled 256 points uniformly from each of the following smooth manifolds: 1) the curve f : R → R 3 f ( t ) = (cos( t ) , sin( t ) , cos(2 t )) smoothly embedded into R 15 , 2) a 2-dimensional torus from R 3 smoothly embedded into R 15 , 3) a 3-dimensional sphere from R 4 smoothly embedded into R 15 . According to Whitney’ s embedding theorem, we expect that the maximum dimension required to embed the curve, torus, and sphere is 3, 5, and 7, respectively . In Figure 4, we see that the SAP algorithm provides an embedding for the three data sets in Euclidean space at these respectiv e dimensions. Speciﬁcally , the norm of the shortest projected secant has increased aw ay from zero for these dimensions. W e note further that the SAP algorithm provides embeddings at dimensions below the Whitney upper bounds in the case of the torus and 3 -sphere. While it is not uncommon for embeddings to exist in dimensions belo w the Whitney upper bound, the Fig. 4: The ` 2 -norm of the shortest secant projection as a function of the dimension of the projection. Note that the dimensions at which the data sets appear to have successful embeddings agree with those provided as upper bounds by Whitney’ s embedding theorem (3, 5, and 7, respectiv ely). Also note that the results reﬂect the relativ e dimensionality of the data sets. appearance of such embeddings here further highlights the potential usefulness of the SAP algorithm for dimensionality reduction. W e observe as well that there is a qualitativ e distinction between the behavior of the three data sets shown in Figure 4. The norm of the shortest projected secant vector for the three data sets preserv es the relativ e dimensionality relationships between the data sets. B. The pumpkin illumination space W e demonstrate the use of our algorithm on an illumination space data set. W e collected the data for use in the Pattern Analysis Lab at Colorado State Uni versity . The data consists of images of a solid object under varying illumination conditions. Speciﬁcally , we capture images of a plastic Hallo ween jack-o- lantern under varying illumination conditions. The images are color images of size 480 × 720 . W e pre-process by using PCA to reduce dimension and remove noise. Consequently , we have a set of 200 data points in R 200 , each of which corresponds to one of the original images. This data set illustrates the potential value of the Secant- A voidance Projection algorithm. The data set represents a sample of a subspace of the illumination space of the pumpkin, analogous to many existing data sets that capture individuals and objects under varying illumination. Illumination spaces of people have been well-studied and there have been attempts to understand the dimension of such spaces; see, e.g., [19], [20], [21], [22]. W e conjecture that the illumination space of the pumpkin will ha ve similar characteristics, and the SAP algorithm provides a means of understanding the dimension- ality of the sampled subspace of the illumination space (we expect to see a subspace because the variations in illumination Fig. 5: The projection of the pumpkin data set using the SAP algorithm. T wo distinguished points are marked with a red circle and a red diamond - see Figure 7 for the corresponding images from the original pumpkin illumination space data set. Fig. 6: The projection of the pumpkin data set using the SAP algorithm with points connected by their order in the time series in which the data was collected. were restricted to a subset of those that would provide a broad representation of the set of all possible illuminations). The pumpkin illumination space data is a real-world data set with noise and unknown structure, and we do not kno w a priori the natural dimension of the data. But dimensionality , in general, is of primary importance for making data storage and analysis a manageable task. By applying the SAP algorithm, we obtain an estimate of the dimension of this data set. In this case, we note that the SAP algorithm provides a diffeomorphic copy of the data in R 3 . Using Whitney’ s Theorem, we spec- ulate that the subspace of the illumination space determined by these constrained lighting conditions has dimension one. W e can further understand characteristics of the data set that become apparent after applying the SAP projection. For example, note that two points in the projection shown in Figure 5 appear to be extrema; by viewing the images from Fig. 7: Images Corresponding to Extremal Points in Projection to R 3 via SAP . The left image corresponds to the red circle and the right image corresponds to the red diamond in Figure 5. Fig. 8: The con ver gence of the SAP algorithm on the pumpkin data set. The norm of the shortest projected secant vector generally increases with iteration number . the original data set that correspond to those extremal points, we note that indeed, we have found points that correspond to extreme lighting conditions (Figure 7). Qualitatively , the SAP projection appears to have preserved fundamental features in the data set. For e xample, the data set can be vie wed as a time series as a consequence of the way the data was captured - each successiv e data point corresponds to a small change in lighting conditions. Thus, we would hope that a good projection would preserve this feature of the data, and indeed, if one connects the points in the time series as in Figure 6, we see that the smooth light v ariations that v aried repeatedly from one extreme to another resulted in an embedding that parametrizes a non-self-intersecting path in three-space. In Figure 8, we see the con ver gence of the algorithm ov er 100 iterations, where the dimension has been ﬁxed to be three. Note that while there are small perturbations, the norm of the shortest projected secant vector increases with the iteration number and thus we expect that the projections are getting correspondingly better . W e compare the performance of the SAP algorithm against that of PCA in Figure 9. Note that both would provide Fig. 9: Comparison of the norm of the shortest projected secant vector as a function of dimension after computing a projection using PCA and a projection using SAP on the pumpkin data set. Note that the SAP algorithm outperforms PCA at lo wer dimensions and persists through high dimen- sions according to this measure of success. relativ ely good projections into R 20 , but the SAP algorithm outperforms PCA signiﬁcantly at lower dimensions and per- sists in providing a better projection for higher dimensions as well (with respect to this measure of success). Further , the manner of construction of the data set suggests that the data should liv e on a one-dimensional manifold and hence, the embedding into R 3 by the SAP algorithm is consistent with the upper bound from Whitney’ s Theorem. C. Indian Pines hyperspectral data cube As our ﬁnal example we consider the application of our al- gorithm to a modiﬁed version of the Indian Pines hyperspectral data set (some bands cov ering the region of water absorption are removed) [23]. The data cube is 145 × 145 × 200; that is, there are 200 bands, each with spatial resolution of 145 × 145 . W e deﬁne a data set X ⊂ R 200 to be the collection of vectors of spectral information taken across all pix el locations. W e then hav e | X | = 21 , 025 . The number of secants associated to this number of points is too massiv e to handle with the current version of the SAP algorithm so we use a sampling technique and show that in this example at least, such a strategy yields similar results across different random samples. In Figure 10 we show the plot of the dimension of the projection versus the norm of the shortest projected secant for ten experiments in which we randomly sample 512 points from X and run the SAP algorithm on the associated secant set. As can be seen, aside from minor deviations, these curves are all very similar suggesting that sampling points might be a reasonable strategy to adopt in the setting when the total number of secants in the data set is very large. Note also that in all cases the curve jumps from less than 0 . 05 to 0 . 2 as the projection dimension increases from 3 to 4 . Thus, we have an embedding of the Indian Pines data into Euclidean space of Fig. 10: The ` 2 -norm of the shortest secant projection as a function of the dimension of the projection for the Indian Pines data set. Each curve in the graph corresponds to SAP- algorithm projections computed for a subsample of 512 points from the Indian Pines data set. Note the sudden jump in ` 2 - norm when the dimension of the projection increases to 4 . dimension 4. If the data resides on a manifold, that manifold most likely has dimension two. V I I . D A T A W I T H N O I S E In the real world, data is almost always contaminated with some amount of noise. If a data set in R n sits on an m - dimensional manifold M with m < n , then adding noise will tend to push data points off of M , thereby increasing its apparent dimension. Since we are usually more interested in calculating the dimension of the pure signal without noise, it is useful to de velop tests of dimensionality which are resistant to added noise. In this section we suggest an adaptation to the SAP algorithm for use on noisy data. The idea behind this adaptation stems from the simple observation that when noise is added to a data set it perturbs the positions of points. W e therefore expect that the direction of secants between points which are close together will change much more than the direction of secants between points which are far apart. In f act, we can imagine that if tw o points are sufﬁciently close, then when noise is added, their secant could be rotated to any direction. W e propose to adapt the SAP algorithm for the setting of noisy data by including an additional step in which secants are thresholded by length. That is, prior to normalization, we discard any secants with length less than some predetermined threshold ` . This thresholding v alue ` is then an additional parameter to the algorithm. There is a clear trade-off in potential choices of the size of ` : a small ` may lead to noise- based structure persisting in the data, and a large ` may cause small-scale structure in the data to be lost. In Figure 11 we sho w the result of adding random Gaus- sian noise to points drawn from the trigonometric moment curve φ : R → R 10 from Section V; the noise is added independently in each coordinate and has mean 0 and standard Fig. 11: Projection of Trigonometric Moment Curve with noise onto ﬁrst three coordinate directions. The noise is independent for each coordinate and has mean 0 and standard deviation 0.1. deviation 0 . 1 . Figure 12 shows the result of running the SAP algorithm on the data without any noise added, with the noise added but without thresholding, and with the noise added and thresholding secants at length 2. Note that the thresholding procedure has improved the norm of the shortest projected secant of the noisy data to be more similar to the setting in which there is no noise. This is notably the case for projections of dimensions 3 - 6 , which is where we would look to better understand the dimension of this data set. In fact, comparing Figure 12 to Figure 4, we note that the plot of φ with noise and no thresholding looks similar to the plot of a genuinely higher dimensional data set. Thresholding secants remov es this ambiguity . V I I I . C O N C L U S I O N In this paper we described a no vel algorithm for dimension- ality reduction. The Secant-A v oidance Projection algorithm produces projections of the data into a lower -dimensional space. Importantly , these projections preserv e dimension and hav e smooth in verse. Since our algorithm requires many independent calculations with the elements of the secant set, it is well-suited to a GPU implementation. W e also show that besides ﬁnding good projections of a data set, our algorithm can be used to identify the approximate dimension of a data set. There remain sev eral related open questions and directions for future work; we highlight a selection here. 1) W e have proposed a variation of the SAP algorithm for noisy data. This variation incorporates a threshold `, where we discard an y secants shorter than ` . A next step is the de velopment of a rigorous method of determining ` for an arbitrary data set. 2) GPUs provide an ideal setting for the SAP algorithm because of the natural parallelization of computation of the secant set. Further work should be done to dev elop Fig. 12: W e compare the norm of the shortest projected secant for: 1) SAP applied to points sampled from the curve deﬁned by φ from Section V, 2) SAP applied to φ with noise added, and 3) SAP with thresholding applied to φ with noise added. Note that the thresholding procedure has improved the embedding and made the results more similar to the no-noise setting. an efﬁcient algorithm that utilizes GPUs to their full extent throughout the implementation. 3) W e recognize that the advantages gained by parallel computation of the secant set on the GPU have limi- tations. One av enue for further research lies in potential means of combining information from repeated sampling from a data set with independent applications of the SAP algorithm. One algorithm we suggest is to cluster resulting projections and to compute an appropriate av erage for each cluster , resulting in several potentially useful methods of projection. Such an algorithm would be particularly useful in the setting in which the mem- ory required for computation of the full secant set is infeasible. 4) The proposed algorithm provides no guarantees of con- ver gence to a global optimum. Extensions of this work could propose sufﬁcient conditions for global e xtrema or an algorithm that does hav e such a guarantee. W ith high-dimensional data being generated at an unprece- dented rate, we can expect that demand will only increase for methods of (i) deducing the actual dimension of data and (ii) reducing the dimension of data. Ultimately , we hope that this paper will serve as a ﬁrst step to ward a broader con v ersation on how to best harness the power of GPU computing to develop secant-based methods that offer efﬁcient solutions to these problems. A C K N O W L E D G E M E N T S This paper is based on research partially supported by the National Science F oundation under Grants No. DMS-1513633, and DMS-1322508 as well as DARP A awards N66001-17-2- 4020 and D17AP00004. R E F E R E N C E S [1] Joshua Thompson, David W . Dreisigmeyer , T erry Jones, Michael Kirby , and Joshua Ladd, “ Accurate fault prediction of BlueGene/P RAS logs via geometric reduction, ” in Pr oceedings 1st W orkshop on F ault-T olerance for HPC at Extr eme Scale (FTXS 2010) , Chicago, Illinois, 2010. [2] Kun W ang, Josh Thompson, Chris Peterson, and Michael Kirby , “Iden- tity maps and their extensions on parameter spaces: Applications to anomaly detection in video, ” in Science and Information Conference (SAI), 2015 . IEEE, 2015, pp. 345–351. [3] D.S. Broomhead and M. Kirby , “Large dimensionality reduction using secant-based projection methods: The induced dynamics in projected systems, ” Nonlinear Dynamics (Special Issue on Reduced Order Modelling) , v ol. 41, no. 1-3, pp. 47–67, 2005. [4] D.S. Broomhead and M. Kirby , “ A new approach for dimensionality reduction: Theory and algorithms, ” SIAM J. of Applied Mathematics , vol. 60, no. 6, pp. 2114–2142, 2000. [5] D.S. Broomhead and M. Kirby , “The Whitney reduction network: a method for computing autoassociativ e graphs, ” Neural Computation , vol. 13, pp. 2595–2616, 2001. [6] V ictor Guillemin and Alan Pollack, Differ ential topology , AMS Chelsea Publishing, Pro vidence, RI, 2010, Reprint of the 1974 original. [7] W . Hurewicz and H. W allman, Dimension theory , Princeton mathemat- ical series. Princeton Univ ersity Press, 1948. [8] D. S. Broomhead, R. Jones, and G. P . King, “T opological dimension and local coordinates from time series data, ” J. Phys. A: Math. Gen , vol. 20, pp. L563–L569, 1987. [9] James R. Munkres, T opolo gy: a ﬁrst course , Prentice Hall, Engle wood Cliffs, N.J., 1975. [10] DS Broomhead, R Indik, AC Newell, and DA Rand, “Local adaptiv e Galerkin bases for large-dimensional dynamical systems, ” Nonlinearity , vol. 4, no. 2, pp. 159, 1991. [11] R.V . Abadi, D. S. Broomhead, R.A. Clement, J.P . Whittle, and R. W or- folk, “Dynamical systems analysis: A new method of analysing con- genital nystagmus wav eforms, ” T echnical Report 97-3, UMIST Applied Mathematics, 1997. [12] D. Hundley and M. Kirby , “Estimation of topological dimension, ” in Proceedings of the Third SIAM International Conference on Data Mining , San Fransico, 2001, pp. 194–202. [13] Morris W . Hirsch, Differ ential topology , vol. 33 of Graduate T e xts in Mathematics , Springer-V erlag, New Y ork, 1994, Corrected reprint of the 1976 original. [14] M. Anderle, D. Hundley , and M. Kirby , “The bilipschitz criterion for mapping design in data analysis, ” Intelligent Data Analysis , vol. 6, no. 1, pp. 85–104, 2002. [15] M. Kirby , Geometric Data Analysis: An Empirical Appr oach to Dimensionality Reduction and the Study of P atterns , W iley , 2001. [16] K enneth Falconer , F ractal g eometry , John W iley & Sons, Inc., Hoboken, NJ, second edition, 2003, Mathematical foundations and applications. [17] John Nickolls, Ian Buck, Michael Garland, and K evin Skadron, “Scal- able parallel programming with CUD A, ” Queue , vol. 6, no. 2, pp. 40–53, Mar . 2008. [18] NVIDIA, “Cusolver library , ” 2018, [Online; accessed 25-February- 2018]. [19] Peter N Belhumeur and David J Kriegman, “What is the set of images of an object under all possible illumination conditions?, ” International Journal of Computer V ision , vol. 28, no. 3, pp. 245–260, 1998. [20] Athinodoros S. Georghiades, Peter N. Belhumeur , and Da vid J. Krieg- man, “From few to many: Illumination cone models for face recognition under variable lighting and pose, ” IEEE transactions on pattern analysis and mac hine intelligence , vol. 23, no. 6, pp. 643–660, 2001. [21] Jen-Mei Chang, Michael Kirby , Holger Kley , Chris Peterson, Bruce Draper , and J Ross Beveridge, “Recognition of digital images of the human face at ultra low resolution via illumination spaces, ” in Asian Confer ence on Computer V ision . Springer, 2007, pp. 733–743. [22] Jen-Mei Chang, J Ross Beveridge, Bruce A Draper , Michael Kirby , Holger Kley , and Chris Peterson, “Illumination face spaces are idiosyn- cratic., ” IPCV , v ol. 2, pp. 390–396, 2006. [23] Grupo de Inteligencia Computacional, “Hyperspectral remote sensing scences, ” 2014, http://www .ehu.eus/ccwintco/index.php/Hyperspectral Remote Sensing Scenes, Last accessed on 2018-4-30.

A GPU-Oriented Algorithm Design for Secant-Based Dimensionality Reduction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment