Multi-Scale Local Shape Analysis and Feature Selection in Machine Learning Applications

We introduce a method called multi-scale local shape analysis, or MLSA, for extracting features that describe the local structure of points within a dataset. The method uses both geometric and topological features at multiple levels of granularity to…

Authors: Paul Bendich, Ellen Gasparovic, John Harer

Multi-Scale Local Shape Analysis and Feature Selection in Machine   Learning Applications
MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS P A UL BENDICH, ELLEN GASP ARO VIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS A B S T R AC T . W e introduce a method called multi-scale local shape analysis for extracting features that describe the local structure of points within a dataset. The method uses both geometric and topological features at multiple lev els of granularity to capture diverse types of local information for subsequent machine learning algorithms operating on the dataset. Using synthetic and real dataset examples, we demonstrate significant performance im- provement of classification algorithms constructed for these datasets with correspondingly augmented features. 1. I N T R O D U C T I O N The goal of this paper is to introduce a preliminary version of what we call multi-scale local shape analysis (MLSA), a method for extracting features of a dataset that describe the local structure, both manifold and singular , of points within the dataset. MLSA is a mixture of multi-scale local principal component analysis (MLPCA) and persistent local homology (PLH). In this paper , we will describe both of these techniques and our merger of them, and we will demonstrate the potential of MLSA on two synthetic datasets and one real one. The potential of these methods and their merger is in vestigated in the context of one of the typical applications for data analytics: the classification problem for multi-dimensional datasets. Thus the relev ance of the de veloped techniques is assessed as the quality of the resulting classification decision rule, measured by the e xpected test misclassification error, its sensitivity and specificity (f alse positive and f alse negativ e error rates). The quality of the solution of the classification problem significantly depends on the choice of the features – specifically , on (1) extraction of new features that can contain additional relev ant information for the giv en problem, (2) pre-processing the features in a way that makes them feasible for scalable and robust computations, and (3) removing features that ha ve little relev ancy for the problem. While there are well-kno wn mechanisms of removing features (i.e., feature selection, see [23], [9]), the problem of constructing or adding features [16] is much more challenging (see [18], [30]), since it often relies on domain expertise, which is dif ficult to automate. That is why , besides domain expertise, numerous geometrical approaches for feature extraction hav e been employed to reduce the misclassification error rate of the decision rule (e.g., kernel PCA [26], mutual information [30], manifold learning [19], [14], image- deriv ed features [8], [27]). The added benefit of these methods stems from the fact that geometric methods can expose additional rele vant information about shapes that are hidden in the original data. The methods outlined in this paper capture both geometric (MLPCA) and topological (PLH) structures of datasets, thus e xposing the corresponding structures to machine learning tools and boosting their performance. 1 2 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS MLPCA. Principal Component Analysis (PCA) is a standard technique that takes a point cloud X ⊆ R D as input, and returns information about the directions of data v ariation and their relati ve importance. A standard output of PCA is an integer k and a projection of the data onto the k -dimensional linear subspace of R D spanned by the k most important directions. In this case, it is reasonable to say that the “intrinsic dimension” of X is k in the sense that if one “zooms in” at a point in X , it will look like a k -flat. Of course, for some datasets the intrinsic dimension varies as we mo ve around the data (think of a plane pierced by a line in R 3 ). For almost an y dataset, the notion of “intrinsic dimension” depends on how much one wants to zoom in. For example, data sampled from a thin strip in R 2 will look either one- or two-dimensional, depending on the notion of local scale. PLH. T o complicate matters further , there are datasets X and points z for which the statement “Locally at z , X looks like a k -flat” is simply not true, for any value of k . For example, consider a dense sample from the intersection of two planes in R 3 . If z is any point very near to the line of i ntersection, then MLPCA at almost an y radius will return a 3 - flat, which is not a reasonable answer . The key issue here is that proximity to singularities complicates the notion of local dimension. It is here where PLH proves useful. The concept of PLH is built off of a traditional algebro-topological notion called local homology gr oups ; see, for example, Chapter 35 in [24]. These groups are meant to assess the “local” structure of a point z within a topological space X . Among their nice proper- ties is the fact that they are the same for every point z precisely when X is a topological manifold. When X is not, these groups dif fer as we move z around, and in fact provide a great deal of information about the local singularity structure at each non-manifold point. On the other hand, the concept of “local” is a tenuous one in the noisy point-cloud context, where what is meant by local depends entirely on an often impossible-to-choose scale parameter . This issue was addressed in [5], where a tweakable radius parameter R was added to the definition. The version of local homology that we will use in this paper dif fers slightly from that in [5], but we feel it is simpler both for exposition and for computational purposes. Related work. T o the best of our knowledge, this is the first attempt to use PLH in the construction of features for classification problems. PLH has been used before [15] in the context of dimension reduction. Howe ver , the goal there is to use PLH to de- tect the dimension of the manifold which, under the assumptions of that paper , underlies the given dataset. Our goal is quite dif ferent: to augment a more standard dimension- detection method with PLH in order to understand features that may arise by dropping the underlying-manifold assumption. Another paper [6] uses PLH, and also a more complicated mapping construction to transfer PLH information from one point to another , in an effort to learn the underlying stratified structure of a space from a point sample of it. There is also some work [28] which learns the stratified structure of a union of flats via Grassmannian methods. Finally , a recent paper [2] uses PLH in the construction of a novel distance between different road map reconstructions. Outline. The structure of this paper is as follows. In Section 2, we briefly revie w the ideas behind MLPCA, followed by a more in-depth description of PLH in Section 3. Then, in Section 4, we combine the tw o techniques into MLSA and demonstrate its utility in machine learning experiments inv olving three sample datasets, two synthetic and one real. The results of these experiments are summarized in the tables of Section 6. W e conclude with some discussion in Section 5. MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 3 Figure 1: Multi-scale local principal component analysis (MLPCA). 2. M U LTI - S C A L E L O C A L P R I N C I PA L C O M P O N E N T A N A L Y S I S In multi-scale local principal component analysis (MLPCA) [22], one takes a point cloud X , a particular point z and a radius R , and one computes PCA on the sub-cloud of points within the Euclidean R -ball around z . This process is then repeated for multiple radii and at many points to get a general profile of how the dataset looks at different lo- cations in different scales (Figure 1). These multi-scale features may then be used in the ov erall data analysis along with the original features (coordinates of the points). This approach belongs to a growing f amily of multi-scale methods exploring and lev er- aging natural differences in information av ailability and its relev ance on different scales of the dataset in question. Multi-scale methods tend to rev eal more accurate information about the structure of the dataset than global (single scale) techniques. The eigenv alues and eigen vectors of the cov ariance matrix provide a core set of features which capture geometric information about the dataset. The first k eigen vectors of the co- variance matrix define a k -dimensional hyperplane (through the center of mass) minimiz- ing P i (distance( x i , P )) 2 , where the infimum is taken over all k -planes P . This geometric information can be significantly enriched by computing the eigen values and eigen vectors for a set of multi-scale neighborhoods of points. Bassu et. al. [4] exploited multi-scale local PCA features to define a support vector machine (SVM) decision rule that distin- guished pointwise two unknown empirical measures (ground and vegetation) on the same domain of LID AR-generated surf ace images. The same authors later applied this technique for classification of various types of satellite images of v essels [3]. These two image anal- ysis experiments demonstrate that centralized multi-scale PCA features on data sets in R n provide necessary and suf ficient conditions for datasets to have certain properties. 3. P E R S I S T E N T L O C A L H O M O L O G Y This section contains a formal description of PLH features, which will be used in the next section to augment MLPCA features in several example applications. W e assume that the reader understands homology groups, and we gi ve only the briefest of revie ws of persistent homology . For a good reference on the former, see [24]; for the latter , see [17] or [11]. All homology groups need to be computed over a field for the definition of persistence to make sense, and that field is usually Z / 2 Z for computational reasons. 3.1. Persistent Homology. Suppose that X is a topological space equipped with a real- valued function f . F or each real number α , we define the thr eshold set X α = { x ∈ X | f ( x ) ≤ α } . Note that increasing α from ne gati ve to positiv e infinity provides a filtration 4 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS of X by these threshold sets. F or each non-negati ve integer k , the persistence diagram D gm k ( f ) summarizes the appearance and disappearance of k -th homology during this filtration. For example, take X to be a closed interval and f to be the function whose graph is drawn in black on the left of Figure 2. Then the persistence diagram D g m 0 ( f ) , shown as black squares on the right of the same figure, encodes the appearance and subsequent merging of components under this filtration of the interv al. Figure 2: Left: the graphs of functions f (black) and g (red). Right: the persistence diagrams D gm 0 ( f ) (black) and D gm 0 ( g ) (red). Stability theorem. It turns out that the persistence diagram Dg m k ( f ) is robust to small changes in the input function. T o make this more precise, we properly define a persistence diagram to be a multi-set of dots in the extended plane, with the extra condition that there is a dot of infinite multiplicity at each point along the major diagonal y = x. The persistence of a dot u = ( x, y ) in a diagram is defined to be y − x , its vertical distance to the major diagonal. Giv en a p ∈ [1 , ∞ ) , the p -th W asserstein distance between two diagrams D and D 0 is then defined to be: (1) W p ( D , D 0 ) =  inf φ : D → D 0 Σ u ∈ D || u − φ ( u ) || p  1 p , where the infimum is taken ov er all bijections φ between the diagrams; note that such bi- jections al ways exist, due to the infinite multiplicity dots along the major diagonal. Letting p tend to infinity results in the bottleneck distance W ∞ between the diagrams. These distances are computed [17] via constructing a minimal-weight perfect matching on a weighted bipartite graph, which means that the distances themselves are not useful as an efficient tool. Howe ver , they are important for stating the stability properties of persistence diagrams, as we now illustrate via an e xample. Let g be the function whose graph appears in red on the left of Figure 2, with diagram D gm 0 ( g ) gi ven by red circles on the right. Then the optimal bijection between the two diagrams would be the one that matches the three red dots along the diagonal to the black diagonal, and the other red dots to their closest black square, with the bottleneck distance being the longest distance any red dot has to mo ve during this process. Note that the two diagrams are quite close under this metric, as are the two functions under the L ∞ -metric. This is true in general: Theorem 3.1 (Diagram Stability Theorem) . Let f and g be two tame functions on a com- pact space X . Then, for each non-ne gative integ er k , we have: W ∞ ( D gm k ( f ) , Dg m k ( g )) ≤ || f − g || ∞ . MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 5 See [12] and [11] for a more technical discussion of this theorem, and see [13] for similar theorems, with more assumptions required, about W p -stability . 3.2. Local Homology. W e no w describe local homology groups, before moving on to their persistent version. Let Y be a topological space embedded in some R D , fix some point z ∈ R D , a positi ve real number R , a non-ne gativ e integer k , and let S R ( z ) denote the sphere of radius R about z ; that is, S R ( z ) = { y ∈ R D | || y − z || = R } . The k -th local homology group of Y , with center point z and radius R , is defined to be (2) LH k ( Y , z , R ) = H k ( Y ∩ S R ( z )) . For a simple example, take Y to be an infinite plane in R 3 , z to be a point o n that plane, and R to be an y positiv e number . Then LH k ( Y , z , R ) is rank one when k = 1 and is zero otherwise. For a more complicated example, let z be the red point in Figure 3, and let r and R be the radii of the smaller and large circles, respectiv ely . Then LH 0 ( Y , z , r ) and LH 0 ( Y , z , R ) are ranks two and four , respectiv ely; both groups are zero for all other k . Figure 3: W ithin the smaller sphere, the red point looks like part of a 1 -manifold; within the larger sphere, its local structure is more complicated. Instabilities. As defined above, the local homology group LH k ( Y , z , R ) depends on three inputs, and it turns out that it can change in an unstable fashion with each. The example in Figure 3 sho ws that the local homology group can depend strongly on the choice of radius. From the same figure, we can also see that it depends on the choice of center point; for example, if we fix a small value of r and mo ve the center point z gradually along the line from its current location to the crossing point, the rank of LH 0 ( Y , z , r ) jumps suddenly from two to four . The group LH ( Y , z , R ) also clearly depends on the space Y . For a stark example, replace Y in Figure 3 with a dense point sample U . Then, with probability one, the intersection U ∩ S R ( z ) will be empty , and so LH 0 ( U, z , R ) will be zero. Fortunately , the persistent version of local homology , which we now describe, varies continuously with these input parameters, and so is much more suitable to real-world data analysis. 3.3. Persistent Local Homology. In what follows, we let Y be some compact space in R D . For a working example, imagine that Y is the pair of crossing line segments on the left side of Figure 4, but it might also be some point cloud U sampled from it. F or the 6 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS B C D A 12 3 4 Figure 4: Left: Y is the intersecting line segments, z is the marked point, and S R ( z ) is the dashed circle. Right: the persistence diagram P LH 0 ( Y , z , R ) ; the dot on top has infinite death time, and the values of α 1 through α 4 are indicated along the birth-axis. moment, we fix a value of R and a choice of z . The basic idea behind PLH is that, instead of examining the connecti vity of Y ∩ S R ( z ) , we use a gradually thickening Y to filter S R ( z ) and w atch the homological information change during this process. More precisely , we let d Y : R D → R be the distance function which maps each point x in R D to the distance d Y ( x ) from its closest neighbor on Y . Abusing the notation abov e, we let Y α denote the threshold sets for this function. Note that Y 0 = Y , while Y α for positiv e α represents a thickened version of Y ; when Y is a point cloud, Y α is just the union of closed balls of radius α around all of the points in the cloud. For each fixed v alue of α, we form the intersection Y α ∩ S R ( z ) . W e then let α increase from zero to infinity and track the ev olution of the homology groups H k ( Y α ∩ S R ( z )) , call- ing the resulting persistence diagram P LH k ( Y , z , R ) . Note that this is just an alternati ve way of looking at the persistence diagram Dg m k ( d Y | S R ( z ) ) . Example. W ith Y , z , and R as indicated in the caption for the left side of Figure 4, we work through the computation of P LH 0 ( Y , z , R ) . The original space Y ∩ S R ( z ) con- sists of three points, which we will call A, B , C, working from left to right. A small amount of thickening, say to α 1 , produces a fourth component D . Almost immediately after , at α 2 , the components A and B merge, followed quickly , at α 3 by the merging of D and C . At that time, there are two growing components, and they ev entually merge when the entire sphere is filled in at α 4 . In summary , we ha ve the birth-death pairs { (0 , ∞ ) , (0 , α 4 ) , (0 , α 2 ) , ( α 1 , α 3 ) } , which leads to the diagram shown on the right of the same figure. Stabilities. As promised above, we now show that persistence diagrams P LH k ( Y , z , R ) are robust to small changes in any of their three inputs. Recall that the Hausdorff distance d H ( Y , Y 0 ) between tw o compact subsets is defined to be the minimum  such that Y ⊆ Y 0  and Y 0 ⊆ Y  , Theorem 3.2. Let Y , Y 0 be compact subsets of R D , let z , z 0 ∈ R D , and let R, R 0 > 0 . Set  = d H ( Y , Y 0 ) and fix a non-ne gative integ er k . Then: W ∞ ( P LH k ( Y , z , R ) , P LH k ( Y , z , R 0 ) ≤ | R − R 0 | , W ∞ ( P LH k ( Y , z , R ) , P LH k ( Y , z 0 , R ) ≤ || z − z 0 || , W ∞ ( P LH k ( Y , z , R ) , P LH k ( Y 0 , z , R ) ≤ . MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 7 Pr oof. For the first inequality , we may assume that our fixed center point z is at the origin in R D . W e define two functions f , g : R D → R by f ( x ) = d Y ( Rx ) and g ( x ) = d Y ( R 0 x ) . Note that restricting f and g to the unit sphere S in R D is the same thing as restricting d Y to S R ( z ) and S R 0 ( z ) , respectiv ely . So by Theorem 3.1, we ha ve W ∞ ( P LH k ( Y , z , R ) , P LH k ( Y , z , R 0 )) ≤ || f | S − g | S || ∞ . T o bound the right-hand-side, fix some x ∈ S , and let y 0 and y 1 be the closest points in Y to Rx and R 0 x , respectiv ely . Then f ( x ) = || y 0 − R x || ≤ || y 1 − R x || , g ( x ) = || y 1 − R 0 x | ≤ || y 0 − R 0 x || . Assume for the moment that f ( x ) ≥ g ( x ) . Then | f ( x ) − g ( x ) | ≤ ||| y 1 − R x || − || y 1 − R 0 x ||| ≤ || x || ( | R − R 0 | ) = | R − R 0 | . An identical argument takes care of the case when g ( x ) > f ( x ) , and taking a maximum ov er S giv es the claim. A similar argument with the functions h ( x ) = d Y ( z + Rx ) and j ( x ) = d Y ( z 0 + Rx ) suffices for the second inequality . Finally , it is easy to see that || d Y − d Y 0 || ∞ ≤ d H ( Y , Y 0 ) , for any pair of compact spaces. Since restricting the domain of these two functions to S R ( z ) can only make the L ∞ -distance smaller , a final application of Theorem 3.1 gives the third inequality .  These results can of course be combined, along with the triangle inequality , to gi ve a bound for what happens when all three inputs are changed at once. 4. M L S A F E AT U R E S I N M A C H I N E L E A R N I N G In what follows, we present two examples of synthetically generated point cloud datasets that were sampled from simple stratified spaces. W e use them to inv estigate the role of MLPCA features and PLH features, separately and taken together , in SVM learning appli- cations. W e refer to the combination of MLPCA and PLH features as MLSA featur es . W e also include an example in volving real data, namely , the LID AR ground and vegetation datasets from [10], and demonstrate improved performance results when MLSA features, as opposed to MLPCA or PLH features on their o wn, are used to distinguish between the two classes. 4.1. Data Prepr ocessing. The traditional approach to data pre-processing consists of scal- ing all the v ariables (features) to a common range of values (such as [0,1] or [-1,1] etc.). Alternativ ely , for each of the features, its empirical average can be subtracted from each data point, and the result divided by the empirically calculated standard deviation: that way , each coordinate of the data will hav e its mean equal to zero with variance equal to one, as is the case for the standard normal distribution N (0 , 1) . The justification of these pre-processing methods is numerical: the precision of computational operations is less prone to errors if numbers of similar magnitude are being used. It is also important to consider rob ustness of the decision rules under the conditions of data v ariability . Although there is no comprehensi ve theory of treating such issues (some data adaptation approaches are proposed in [29]), there are some empirical observations (stemming from machine learning problems in rather div erse application areas including 8 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS computer networking, medical diagnostics, computer vision etc.) that suggest that deliber- ate pruning of av ailable information can mak e the resulting decision rules more rob ust and reliable, giv en the inevitable v ariability between distributions of training and test datasets. Specifically , we consider discretizing or binning of the already scaled data (with their first moments matching N (0 , 1) ) along each of the av ailable coordinates into several values that are centers of equal-integral/equal-area se gments of the standard normal density func- tion N (0 , 1) . In this paper , we consider discretization into 10 equal-inte gral bins, so that the boundaries [ b i , b i +1 ] of these bins are such that the probability of an N (0 , 1) -distributed random value belonging to an y of them is 1 / 10 , i.e., 1 √ 2 π Z b i +1 b i exp  − x 2 2  dx = 1 10 . This condition is realized by the following bins boundary v alues: −∞ , − 1 . 2816 , − 0 . 8416 , − 0 . 5244 , − 0 . 2533 , 0 , +0 . 2533 , +0 . 5244 , +0 . 8416 , +1 . 2816 , + ∞ Although discretization clearly reduces the information av ailable in the given datasets, the accuracy (error rate) of the classification decision rule constructed on the discretized dataset is usually comparable with the accuracy of the classification decision rule con- structed on the original dataset. Moreover , while the error rate is about the same, the balance of sensitivity and specificity is usually more stable for the discretized dataset. One can also argue that discretization provides graceful handling of outliers without losing the pertinent information and retaining the general direction of the outlier . Finally , discretiza- tion appears to be more rob ust to statistical deviations between training set and test set (the key assumption of machine learning is that both these sets should hav e the same distrib u- tion). 4.2. Synthetic Dataset Examples. Datasets with Crossing P oints. W e sampled 200 points from four different spaces, each consisting of unions of line segments within a disk of radius 0 . 4 centered at the origin. The four spaces are: a “ + ” shape formed from portions of the x and y axes; an “X” shape formed by portions of the lines y = − 2 x and y = 3 x near the origin; a “Y” shape made up of portions of the lines y = − x and y = x above the x axis together with part of the negati ve y axis; and a “triple” of line segments consisting of the plus sign together with part of the line y = x . At each of the points in the datasets, we performed MLPCA at three dif ferent radii ( 0 . 1 , 0 . 2 , and 0 . 3 ) and e xtracted the eigen v alues and components of the corresponding eigen vectors, yielding 18 MLPCA features per point. Furthermore, for each of the points at the same three radii, we performed PLH, extracted the six most persistent 0 -dimensional classes, and recorded as features the persistences of these six classes. The reason for selecting six classes is that at e very point in each of the datasets, the 0 -dimensional local homology groups are of rank at most three for the Y , at most four for the + and the X, and at most six for the triple crossing. Thus, each of the feature vectors has length 36 . In e very instance, we trained a linear SVM classifier . W e generated 50 examples from each category of dataset for training and 15 of each type for testing, for a total of 10 , 000 points for training and 3 , 000 for testing for each of the four categories. T o e valuate the performance of our features in the SVM classification process, we com- puted the maximum error rate for T ype I (misclassify dataset A as dataset B) and T ype II MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 9 (misclassify dataset B as dataset A) errors. W e also recorded the sensitivity ( 100% mi- nus the T ype I error rate) and specificity ( 100% minus the T ype II error rate) as additional measures of accuracy . The results of the six pairwise data e xperiments are reported in T ables 1-6 in Section 6. In e very case, the combination of MLPCA and PLH features from MLSA led to the lo west error rates, with PLH alone outperforming MLPCA alone in fiv e out of the six cases (the exception being the X vs. the triple crossing, see T able 5). The discretization procedure did not appear to hav e a strong impact on the results: binning impro ved MLSA accurac y rates in three out of the six cases, and tied with the no binning accuracy rates in one instance. The lowest error rate, 0 . 03% , w as achiev ed for two different pairs: the + and the triple crossing, as well as the X and the Y . For the former , the underlying + shape is a subset of the triple crossing shape, but regardless of location within each dataset, PLH detects the presence of more local homology classes (at least at lar ger scales) in the case of the triple crossing than in the + case. The same is true for the case of the X and the Y , as well as for the pair consisting of the Y and the triple, which also saw an error rate under 1% . F or the case of the + and the X, it is likely that the low MLSA error rates (again, under 1% ) were largely due to differences in birth and death times between their respectiv e local homology classes. The highest error rates (around 5 − 6% ) occurred for the pairs consisting of the + and the Y , and the X and the triple crossing. Ho wever , for the former , recall that both point cloud types contain points on the neg ati ve y axis within a disk of radius 0 . 4 . Since the largest radius in the MLPCA and PLH computations was 0 . 3 , such points on both the + and the Y suf ficiently far from the origin should indeed be indistinguishable from one another . For the latter, the decrease in accuracy may be attributed to the fact that there are a number of points in both the X and the triple crossing point clouds such that computing PLH at those points results in local homology groups of the same rank. Densely Sampled Line Segments with Points on One Side vs. Both Sides. F or our second synthetic dataset example, we obtained point clouds in two different ways: first, by sampling 200 points from the line se gment x = 0 , 0 ≤ y ≤ 1 , along with 200 points from the unit square [0 , 1] × [0 , 1] (see Figure 5(a)); second, by sampling 200 points from the same line segment as well as a total of 200 points from the rectangle [ − 1 , 1] × [0 , 1] ( 100 points on either side of the line se gment, see Figure 5(b)). The goal of the machine learning experiment was to distinguish points on the densely sampled line se gments in the first case from points on the corresponding line segments in the second case. In both cases, for radii 0 . 2 and 0 . 4 , we computed MLPCA features (two eigen values and components of the two associated eigen vectors) and PLH features (the persistence of the most persistent 1 -dimensional PLH class) for a total of 14 features at each of the 200 points on the densely sampled line se gments. The reasoning behind our choice of PLH features is as follo ws. When PLH is computed at points on the densely sampled line segment in Figure 5(b), it detects a high-persistence 1 -dimensional class, whereas the 1 -dimensional local homology is trivial in the case of the line segment with points on only one side. Note that 0 -dimensional PLH data should be the same in both cases. As in our previous set of examples, we trained a linear SVM classifier with 10 , 000 points from each category of dataset for training and 3 , 000 for testing. Once again, the lowest error rate ( 4 . 07% ) was achiev ed when both MLPCA and PLH features were uti- lized, with PLH features alone still vastly outperforming MLPCA features alone ( 7 . 83% vs. 33 . 83% ); see T able 7. In this example, 10 − bin discretization led to poorer accuracy rates for both MLSA features as well as PLH features alone than when no binning was 10 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS employed. For MLPCA features alone, the results were slightly better when binning was utilized ( 31 . 4% vs. 33 . 83% ). (a) (b) Figure 5: Densely sampled line segment with points on one side only vs. both sides. 4.3. LIDAR Image Dataset Results. In order to in vestigate the utility of PLH informa- tion for feature construction in real data, we studied a dataset, introduced in [10], which consists of a total of 639 , 520 three-dimensional points that were collected via LID AR and labeled as either “ground” or “ve getation, ” with 10 subsets of each type (see Figure 6). For proof-of-concept purposes, we randomly sampled 1000 points from each of the 10 subsets in each category , for a total of 20 , 000 points. W e split the data into training and testing groups in the same way as the authors of [10] and [4]; namely , we used the points in data subsets 1 , 2 , 4 , 6 , and 8 for training, and subsets 3 , 5 , 7 , 9 , and 10 for testing. At each of the two scales 2 − 3 and 2 − 4 , we computed MLPCA features (three eigen v alues and components of three eigen vectors) and PLH features (persistences of the most persistent 0 - dimensional class and the most persistent 1 -dimensional class) at each of the 1000 points. In addition, as in [4], we included the coordinates of the points in the list of features. This yielded a total of 31 features. For MLPCA features alone with discretization, the maximum of the T ype I (misclassify ground floor as v egetation) and T ype II (misclassify vegetation as ground floor) error rates was 4 . 95% . When the PLH features were added to the MLPCA features, the maximum error rate was reduced to 4 . 31% , a 15% improvement. See T able 8 for details. Note that Figure 6: LID AR dataset: green points correspond to “vegetation, ” black points correspond to “ground. ” MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 11 taking only PLH features led to greatly reduced lev els of accuracy compared to taking either MLSA features or MLPCA features alone. This is likely due to the low number of PLH features (namely , four) compared to the size of the datasets. In all cases, discretization resulted in an improv ement of accuracy rates on the order of 15 − 20% . 5. D I S C U S S I O N The above results are promising in that multi-scale local shape analysis via a combi- nation of MLPCA and PLH features consistently led to improved classification decision results for both synthetic and real datasets. The synthetic data experiments suggest future research to determine if more sophis- ticated multi-scale local principal component features could improve detection of local dimensions in singular spaces. Moreover , there are other methods of turning persistence diagrams into features (for example, treating the diagram as a binned image [7], or using ideas from algebraic geometry [1]) that may be more advantageous in dif ferent settings. In this context, the advantages of 10 − bin discretization are less pronounced. In fact, in synthetic datasets, the discretization process led to poorer performance results, although it did improv e the classification quality for the real dataset. This is actually to be expected, since the primary value of discretization is to make the decision rule robust in the case when there are statistical differences between training and test data. W ith real life data, it is often the case that there are such differences (e.g., one collects data for training, and at the time of testing the data, the selection mechanism may be slightly dif ferent from that for training, etc.). Thus, binning allo ws the decision rule to retain its robustness in such situations, as in the LID AR case, where v arious patches of ground and v egetation dif fer from each other . If, howe ver , there is a strong statistical match between training and test data (as typically happens in synthetic data, where the match is enforced by sampling the same distribution), then binning is useless: it thro ws away information for the sake of robustness, which is irrelev ant in the case of a perfect match, causing performance to suffer . W e hav e identified se veral steps that we plan to undertak e in future work. Among them is a more adv anced version of the MLPCA approach used in this paper . It relies on features defined in terms of normalized multi-scale constructs called Jones Beta numbers [20, 21, 22, 25]. Our preliminary experiments with this approach sho w improv ed classification rates and lo wer testing error rates for a number of cases in which these are the only features utilized. W e plan to continue with additional experiments and in vestigate what happens when the new features are combined with PLH features. 6. S U M M A RY T A B L E S The follo wing tables summarize the results of the synthetic dataset experiments (T a- bles 1-7) and the LIDAR dataset experiment (T able 8). A v alue of 10 for “Bins” indicates the utilization of 10-bin discretization. 12 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS T able 1: + vs. X Features Bins Sens. Spec. Max Errors PLH No 99.00% 98.63% 1.37% MLPCA No 87.90% 95.67% 12.10% MLSA No 99.37% 99.30% 0.70% PLH 10 89.13% 94.67% 10.87% MLPCA 10 84.23% 93.60% 15.77% MLSA 10 98.23% 98.63% 1.77% T able 2: + vs. Y Features Bins Sens. Spec. Max Errors PLH No 89.23% 92.93% 10.77% MLPCA No 87.83% 94.23% 12.17% MLSA No 93.90% 97.50% 6.10% PLH 10 85.87% 97.23% 14.13% MLPCA 10 82.27% 91.33% 17.73% MLSA 10 93.93% 98.87% 6.07% T able 3: + vs. T riple Features Bins Sens. Spec. Max Errors PLH No 99.73% 99.43% 0.57% MLPCA No 86.67% 96.80% 13.33% MLSA No 99.83% 99.93% 0.17% PLH 10 99.97% 99.77% 0.23% MLPCA 10 86.43% 95.27% 13.57% MLSA 10 99.97% 99.97% 0.03% T able 4: X vs. Y Features Bins Sens. Spec. Max Errors PLH No 95.20% 98.93% 4.80% MLPCA No 79.17% 93.77% 20.83% MLSA No 99.97% 100.00% 0.03% PLH 10 94.67% 97.77% 5.33% MLPCA 10 81.50% 94.80% 18.50% MLSA 10 99.97% 99.97% 0.03% MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 13 T able 5: X vs. T riple Features Bins Sens. Spec. Max Errors PLH No 89.13% 92.03% 10.87% MLPCA No 90.00% 92.53% 10.00% MLSA No 94.20% 98.30% 5.80% PLH 10 89.50% 93.90% 10.50% MLPCA 10 92.60% 93.30% 7.40% MLSA 10 95.00% 98.73% 5.00% T able 6: Y vs. T riple Features Bins Sens. Spec. Max Errors PLH No 98.17% 100.00% 1.83% MLPCA No 92.70% 91.37% 8.63% MLSA No 99.23% 100.00% 0.77% PLH 10 97.47% 100.00% 2.53% MLPCA 10 95.77% 87.57% 12.43% MLSA 10 99.13% 100.00% 0.87% T able 7: One Side vs. Both Sides Features Bins Sens. Spec. Max Errors PLH No 92.17% 99.73% 7.83% MLPCA No 66.17% 86.03% 33.83% MLSA No 95.93% 98.73% 4.07% PLH 10 91.67% 99.73% 8.33% MLPCA 10 68.60% 83.97% 31.40% MLSA 10 95.13% 98.73% 4.87% T able 8: LIDAR Features Bins Sens. Spec. Max Errors PLH No 92.80% 74.04% 25.96% MLPCA No 94.28% 98.86% 5.72% MLSA No 94.90% 98.56% 5.10% PLH 10 92.98% 78.04% 21.96% MLPCA 10 95.05% 99.06% 4.95% MLSA 10 95.69% 99.14% 4.31% 14 P A UL BENDICH, ELLEN GASP AR OVIC, JOHN HARER, RA UF IZMAILO V, AND LIND A NESS R E F E R E N C E S [1] A. Adcock, E. Carlsson, and G. Carlsson. The ring of algebraic functions on persistence bar codes. ArXiv e-prints , 2013. [2] Mahmuda Ahmed, Brittany Fasy , and Carola W enk. Local persistent homology based distance between maps. In Pr oc. ACM SIGSP ATIAL GIS , 2014. to appear . [3] D. Bassu, R. Izmailov , A McIntosh, L. Ness, and D. Shallcross. Application of multi-scale singular vector decomposition to vessel classification in ov erhead satellite imagery . submitted . [4] D. Bassu, R. Izmailov , A McIntosh, L. Ness, and D. Shallcross. Centralized multi-scale singular value decomposition for feature construction in LIDAR image classification problems. In Applied Ima gery P attern Recognition W orkshop (AIPR), 2012 IEEE , pages 1–6, Oct 2012. [5] P . Bendich, D. Cohen-Steiner , H. Edelsbrunner, J. Harer , and D. Morozov . Inferring local homology from sampled stratified spaces. In Proceedings 48th Annual IEEE Symposium on F oundations of Computer Sci- ence , pages 536–546, 2007. [6] P . Bendich, B. W ang, and S. Mukherjee. Local homology transfer and stratification learning. In Pr oceedings of the T wenty-Third Annual A CM-SIAM Symposium on Discrete Algorithms , pages 1355–1370. SIAM, 2012. [7] Paul Bendich, Sang Chin, Jesse Clarke, John deSena, John Harer , Elizabeth Munch, Andre w Newman, David Porter , David Rouse, Nate Strawn, and Adam W atkins. T opological and statistical beha vior classifiers for tracking applications. 2014. [8] S.-T . Bow . P attern Recognition and Image Pr epr ocessing . New Y ork: Marcel Dekker, 2002. [9] L. Breiman. Random forests. Machine Learning , 45 :5–32, 2001. [10] N. Brodu and D. Lague. 3D terrestial LIDAR data classification of complex natural scenes using a multi- scale dimensionality criterion: Applications in geomorphology . ISPRS Journal of Photogrammetry and Remote Sensing , 68 :121–134, 2012. [11] F . Chazal, D. Cohen-Steiner, M. Glisse, L. Guibas, and S. Oudot. Proximity of persistence modules and their diagrams. In Pr oceedings of the 25th annual symposium on Computational geometry , SCG ’09, pages 237–246, New Y ork, NY , USA, 2009. A CM. [12] D. Cohen-Steiner , H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discr ete Comput. Geom. , 37(1):103–120, January 2007. [13] D. Cohen-Steiner , H. Edelsbrunner , J. Harer , and Y . Mileyko. Lipschitz functions have L p -stable persis- tence. F ound. Comput. Math. , 10(2):127–139, February 2010. [14] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis , 21 :5–30, 2006. [15] T . K. Dey, F . Fan, and Y . W ang. Dimension Detection with Local Homology. arXiv e-prints , May 2014. [16] P . Domingos. A fe w useful things to know about machine learning. Communications of the A CM , 55 :78–87, 2012. [17] H. Edelsbrunner and J. Harer . Computational T opology: An Intr oduction . American Mathematical Society , 2010. [18] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Communications of the ACM , 3 :1157–1182, 2003. [19] P . Jones, M. Maggioni, and R. Schul. Manifold parameterizations by eigenfunctions of the Laplacian and heat kernels. Pr oceedings of the National Academy of Sciences , 105 :1803–1808, 2008. [20] P .W . Jones. Rectifiable sets and the traveling salesman problem. In vent. Math. , 102 :1–15, 1990. [21] J.C. L ´ eger . Menger curvature and rectifiability . Ann. of Math. , 149 :831–869, 1999. [22] G. Lerman. Quantifying curv elike structures of measures by using L 2 Jones quantities. Communications on Pur e and Applied Mathematics , 56(9):1294–1365, 2003. [23] H. Liu and H. Motoda. Computational Methods of F eatur e Selection . Chapman & Hall/CRC, 2008. [24] J. R. Munkres. Elements of Algebraic T opology . Addison W esley , 1993. [25] H. Pajot. Analytic capacity , r ectifiability , Menger curvature and the Cauchy inte gral . Number 1799 in Lec- ture Notes in Mathematics. Springer , 2002. [26] B. Sch ¨ olkopf, A. Smola, and K. M ¨ uller . Nonlinear component analysis as a kernel eigen value problem. Neural Computation , 10 :1299–1319, 1998. [27] M. Sonka, V . Hlavac, and R. Boyle. Imag e Pr ocessing, Analysis and Machine V ision . T oronto, Ontario: Thomson Learning, 2007. [28] B. St. Thomas, L. Lin, L.-H. Lim, and S. Mukherjee. Learning Subspaces of Dif ferent Dimension. arXiv e-prints , April 2014. [29] M. Sugiyama, T . Suzuki, and T . Kanamori. Density Ratio Estimation in Machine Learning . Cambridge, MA: Cambridge Univ ersity Press, 2012. MUL TI-SCALE LOCAL SHAPE ANAL YSIS AND FEA TURE SELECTION IN MA CHINE LEARNING APPLICA TIONS 15 [30] K. T orkkola. Feature e xtraction by non-parametric mutual information maximization. Journal of Machine Learning Resear ch , 3 :1415–1438, 2003. D E PART M E N T O F M ATH E M AT I C S , D U K E U N I V E R S I T Y E-mail addr ess : bendich@math.duke.edu D E PART M E N T O F M ATH E M AT I C S , D U K E U N I V E R S I T Y E-mail addr ess : ellen@math.duke.edu D E PART M E N T O F M ATH E M AT I C S , D U K E U N I V E R S I T Y E-mail addr ess : harer@math.duke.edu A P P L I E D C O M M U N I C A T I O N S C I E N C E S E-mail addr ess : rizmailov@appcomsci.com A P P L I E D C O M M U N I C A T I O N S C I E N C E S E-mail addr ess : lness@appcomsci.com

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment