Hypothesis Testing for Topological Data Analysis

Hyp othesis T esting for T op ological Data Analysis Andrew Robinson & Katharine T urner F ebruary 23, 2016 Abstract P ersisten t homology is a vital to ol for topological data analysis. Previous w ork has dev elop ed some statistical estimators for c haracteristics of collections of p ersistence diagrams. Ho w ev er, to ols that provide statistical inference for observ ations that are persistence diagrams are limited. Sp eciﬁcally , there is a need for tests that can assess the strength of evidence against a claim that t w o samples arise from the same p opulation or pro cess. W e prop ose the use of randomization-style n ull hypothesis signiﬁcance tests (NHST) for these situations. The test is based on a loss function that comprises pairwise distances b et w een the elements of each sample and all the elemen ts in the other sample. W e use this method to analyze a range of simulated and exp erimental data. Through these examples w e exp erimen tally explore the pow er of the p -v alues. Our results sho w that the randomization-style NHST based on pairwise distances can distinguish b etw een samples from diﬀerent processes, which suggests that its use for hypothesis tests up on p ersistence diagrams is reasonable. W e demonstrate its application on a real dataset of fMRI data of patients with ADHD. 1 In tro duction T op ological Data Analysis (TD A) fo cuses up on considering shap e within data. F or example, samples may lie on a submanifold and we ma y w an t to learn ab out this manifold, or we may wan t to understand the higher dimensional correlations betw een diﬀerent v ariables. The main tool used in TDA is p ersistence homology , which summarizes how the top ology changes through a ﬁltration of a space. An imp ortan t use of TDA is as a prepro cessing to ol; getting a topological summary of eac h ob ject that ma y b e more tractable than the raw information. It may highlight geometric and top ological features that are of particular in terest. Examples of applications include analysis of the shap e of h uman ja ws (Gamble and Heo, 2010), plan t root systems (Bendich et al., 2010), shapes of calcanei b ones of v arious primates (T urner et al., 2014b) and retriev al of trademark symbols (Cerri et al., 2006). The need to introduce statistical techniques to top ological data analysis has b ecome increasingly apparen t. The observ ations used as input to most analysis techniques are generated sto chastically , often using a sampling mechanism from a pro cess or p opulation. Therefore, it is useful to talk not only ab out a single p ersistence diagram, but ab out an entire collection of them, i.e. a sample, dra wn from some distribution or pro cess. Some progress has been made in, for example, calculating means and v ariances, and applying statistical inference techniques (Mileyk o et al., 2011; T urner et al., 2014a; Balakrishnan et al., 2013; T urner, 2013; Chazal et al., 2013; Bubenik and Kim, 2007). Alternativ e approaches in v olv e reinterpreting the p ersistence diagrams as some functional summary lying in a larger Hilb ert space. In particular, there has been w ork on performing statistics with p ersistence landscapes, including randomization tests (Bub enik, 2015). Another functional summary used is the persistent homology rank function (Robins and T urner, 2015). Here, we focus on the situation in which tw o sets of samples of p ersistence diagrams hav e each been deriv ed from a pro cess, and our goal is to assess the strength of evidence against the assertion that the pro cesses are the same. Null hypothesis signiﬁcance testing (NHST) is a commonly used and imp ortan t statistical to ol that pro vides a measure of the strength of evidence against a hypothesis. In the curren t 1 setting, NHST will quan tify the the diﬀerences betw een tw o diﬀerent t yp es of underlying ob jects or pro cesses, using persistence diagrams as observ ations. F or example, NHST can pro vide a necessary condition as to whether particular p ersistence diagrams could b e used for classiﬁcation. Unfortunately , the space of persistence diagrams is geometrically v ery complicated. It is inﬁnite in dimension and arbitrarily curved (T urner et al., 2014a; T urner, 2013). As a result it is not plausible to use an y parametric mo dels for distributions, so we cannot do NHST using a method that requires an assumption of an underlying parametric mo del. Our approach is to instead ﬁnd a relev an t joint loss function and then use a randomization test (also kno wn as a p erm utation test). This metho d of NHST is standard in statistical theory and is theoretically rigorous (see, e.g., Casella and Berger, 1990; W elsh, 1996). The theory behind the randomization test ensures that when the t w o sets of diagrams are dra wn from the same single distribution of diagrams, then the p v alue obtained is a random v ariable with a uniform distribution ov er an evenly spaced subset of [0 , 1]. Ho wev er we do not know of any theory that the p v alue will necessarily b e lo w if the distributions are diﬀerent. F urthermore, since p ersistence diagrams are summary statistics, it is p ossible that the distributions of the underlying ob jects under analysis are diﬀerent but the corresponding distributions of persistence diagrams are similar. W e will therefore show b y example that there do exist situations where the n ull h yp othesis migh t b e correctly rejected by our metho d. In the following section we provide a brief ov erview of the background theory for TDA and the rationale b ehind NHST. W e then develop a test procedure in Section 3 and also outline a Monte Carlo sim ulation to estimate the corresp onding p -v alues in Section 4. In Section 5 w e apply the resulting algorithm to a range of data including point clouds of shapes and the p ersisten t homology transform of silhouette data, and the concurrence ﬁltration for fMRI data, resp ectiv ely . 2 Preliminaries 2.1 TD A Background Theory P ersistence diagrams are summaries of ho w the homology groups ev olv e ov er ﬁltered spaces. Ho- mology can be computed o v er an y ring but persistent homology requires a ﬁeld. F or computational purp oses this ﬁeld is usually Z 2 . In this pap er all homology will be computed ov er Z 2 . A k -simplex is the con v ex hull of k + 1 aﬃnely indep endent p oints v 0 , v 1 , . . . v k and is denoted [ v 0 , v 1 , . . . , v k ]. F or example, the 0-simplex [ v 0 ] is the vertex v 0 , the 1-simplex [ v 0 , v 1 ] is the edge b et ween the vertices v 0 and v 1 and the 2 simplex [ v 0 , v 1 , v 2 ] is the triangle b ordered by the edges [ v 0 , v 1 ], [ v 1 , v 2 ] and [ v 0 , v 2 ]. T echnically , there is an orientation on simplices. If τ is a p erm utation then [ v 0 , v 1 , . . . , v k ] = ( − 1) sgn( τ ) [ v τ (0) , v τ (1) , . . . , v τ ( k ) ]. Ho w ev er, if we are considering homology o v er Z 2 then 1 = − 1 and we can ignore orientation. W e call [ u 0 , u 1 , . . . u j ] a fac e of [ v 0 , v 1 , . . . v k ] if { u 0 , u 1 , . . . u j } ⊂ { v 0 , v 1 , . . . v k } . A simplicial c omplex K is a countable set of simplices such that • Every face of a simplex in K is also in K . • If t w o simplices σ 1 , σ 2 are in K then their intersection is either empty or a face of b oth σ 1 and σ 2 . Giv en a ﬁnite simplicial complex K , a simplicial k -chain is a formal linear com bination (with co eﬃcien ts in our ﬁeld of choice) of k -simplices in K . The set of k -chains forms a vector space C k ( K ). W e deﬁne the b oundary map ∂ k : C k ( K ) → C k − 1 ( K ) by setting ∂ k ([ v 0 , v 1 , . . . v k ]) = k X j =0 ( − 1) j [ v 0 , . . . ˆ v j , . . . v k ] = k X j =0 [ v 0 , . . . ˆ v j , . . . v k ] for each k -simplex and extending to k -c hains linearly . The second equality follows b ecause our co eﬃcien t group is Z 2 where − 1 = 1. Elemen ts of B k ( K ) = im ∂ k +1 are called b oundaries and elemen ts of Z k ( K ) = k er ∂ k are called cycles. Direct computation sho ws ∂ k +1 ◦ ∂ k = 0 and hence B k ( K ) ⊆ Z k ( K ). This means we can 2 Figure 1: The dimensional one p ersistence diagrams for the ﬁltrations b y the distance function for three diﬀerent lo ops. deﬁne the k th homolo gy gr oup of K to b e H k ( K ) := Z k ( K ) /B k ( K ) . A comprehensive introduction to homology can b e found in Hatcher (2002). A ﬁlter simplicial complex K = { K r | r ∈ R } is a family of coun table simplicial complexes indexed ov er the real n umbers such that each K a is a simplicial complex and K a ⊆ K b for a ≤ b . W e wish to describ e how the top ology of the ﬁltration changes as the parameter increases. F or a ≤ b we hav e an inclusion map of simplicial complexes ι : K a → K b that induces inclusion maps ι : B k ( K a ) → B k ( K b ) and ι : Z k ( K a ) → Z k ( K b ) . These inclusions induce homomorphisms (which are generally not inclusions) on the homology groups: ι a → b k : H k ( K a ) → H k ( K b ) . The image of ι a → b k consists of equiv alence classes of cycles that w ere present in K a , where the homological equiv alence is measured with resp ect to b oundaries in K b . W e then deﬁne the k th dimensional persistence diagram as a m ultiset in { ( x, y ) ∈ [ −∞ , ∞ ] 2 : x < y } such that the n um b er of p oin ts (coun ting m ultiplicity) in [ −∞ , a ] × [ b, ∞ ] is the rank of ι a → b k for all a < b . W e also include coun tably inﬁnitely many copies of the diagonal, these represen t homology classes that do not p ersist for any p ositiv e amount of time. Although w e are considering how the top ology is c hanging w e actually learn a lot ab out geomet- rical features because the ﬁltration has a quan tifying eﬀect. Figure 1 demonstrates ho w p ersistence diagrams may distinguish lo ops, even when they are top ologically equiv alent. 2.2 Distance functions on the space of p ersistence diagrams Let D denote the space of persistence diagrams. There are many c hoices of metrics in D , analogous to the v ariety of metrics on spaces of functions. W e will b e considering the distance metric that is analogous to the L 2 distance in the space of functions on a discrete space and the 2-W asserstein distance b et w een probability distributions. A natural family of metrics is discussed in T urner (2013). Let X and Y b e diagrams. W e can consider bijections φ b et ween the points in X and the points in Y . These are the transport plans that w e consider. Bijections alwa ys exist b ecause there are coun tably man y points at ev ery lo cation on the diagonal. W e only need to consider bijections where oﬀ-diagonal p oin ts are either paired with oﬀ-diagonal p oin ts or with the p oin t on the diagonal that is closest to it. 3 Deﬁne d p ( X, Y ) = inf φ : X → Y X x ∈ X k x − φ ( x ) k p p ! 1 /p (1) where k x − φ ( x ) k p p is the distance from x (respectively φ ( x )) to closest p oin t on the diagonal whenev er φ ( x ) (resp ectiv ely x ) is a cop y of the diagonal. W e will call a bijection b et ween p oin ts optimal if it ac hiev es the inﬁmum. W e can ﬁnd an optimal bijection, given tw o diagrams X and Y with only ﬁnitely many oﬀ diagonal p oin ts, using the Hungarian algorithm (also kno wn as Munkres assignmen t algorithm). Suppose X has n oﬀ- diagonal points, lab elled x 1 , x 2 , . . . x n , and Y has m oﬀ-diagonal p oin ts, lab elled y 1 , y 2 , . . . y m . Let x n +1 , x n +2 , . . . x n + m and y m +1 , y m +2 , . . . y n + m b e copies of the diagonal. W e construct a cost matrix with n + m column and rows where the ( i, j ) entry is k x i − y j k 2 2 . When either x i or y j is a cop y of a diagonal then this is the p erp endicular distance. Eac h transp ortation plan corresp onds to an assignment of ro ws to columns — a bijection b et w een the p oin ts in X and those in Y . Giv en tw o sets X and Y , and pairwise costs asso ciated to assigning to x ∈ X the ob ject y ∈ Y , the Hungarian algorithm ﬁnds the least-cost bijectiv e assignment. Supp ose we ha v e tw o diagrams X and Y eac h with only ﬁnitely man y oﬀ-diagonal points. Consider as many copies of the diagonal in X and Y to allo w the option of matching ev ery oﬀ-diagonal point with the diagonal. The cost of x ∈ X doing tas k y ∈ Y is k x − y k 2 2 . The total cost of an assignmen t (or in other words bijection) φ is P x ∈ X k x − φ ( x ) k 2 2 . The Hungarian algorithm giv es us a bijection φ that minimizes this cost. This means it gives an optimal bijection b et w een X and Y . By taking the limit as p goes to inﬁnity we get the bottleneck distance betw een tw o p ersistence diagrams X and Y ; d ∞ ( X, Y ) = inf φ : X → Y sup x ∈ X k x − φ ( x ) k ∞ . There are many wa ys to create ﬁltrations of interest. One of the most common wa ys is the forming of Rips complexes from point cloud data. W e will use this metho d for our simulated examples later. Giv en a p oin t cloud { x 1 , x 2 , . . . x N } of p oin ts in Euclidean space R n w e deﬁne the Ri ps c omplex with parameter  (denoted R (  )) to b e the ﬂag complex on the graph whose v ertices are { x 1 , x 2 , . . . x N } and contains the edge ( x i , x j ) when k x i − x j k ≤  . W e then build a ﬁltration b y considering the Rips complexes under an increasing parameter. 2.3 P ersistence Diagrams as random elemen ts If our metho d of constructing a ﬁltration is in some wa y random, including, for example, by the random selection of units from a p opulation or pro cess, then this randomness can also b e seen in the corresp onding collection of diagrams. A distribution of ﬁltrations determines a distribution of diagrams. As a result we hav e a persistence diagram v alued random element. This pro cess w orks for an y metho d of creating a ﬁltration whether it is sublevel sets of a function or the ˇ Cec h or Rips complexes from a p oin t cloud. F or example, supp ose we are sampling p oin ts m from a subset K of R d with some noise. W e are sto c hastically generating a p oint cloud that will appro ximate K . This sample generates a distribution ρ point clouds of sets of m p oin ts in R d . Eac h p oint cloud determines a ﬁltration of simplicial complexes and hence a distribution ρ ﬁltrations of ﬁltrations of simplicial complexes. In turn eac h of the ﬁltrations determines a distribution ρ K of persistence diagrams. Ev ery time w e draw m sample p oin ts to create a p oin t cloud we are eﬀectively drawing a sample from the distribution ρ point clouds and hence also dra wing a sample p ersistence diagram from ρ K . Under certain conditions, for example that the sample is random, w e can learn something ab out K b y analyzing ρ K . Supp ose w e hav e another subset L of R d whic h we can similarly sample to form p oin t clouds. W e may wish to know if K and L are diﬀeren t. Our null hypothesis w ould b e that they are the same subset. A necessary , but not suﬃcient, criterion for K to b e L is that ρ K = ρ L . This implies that our n ull hypothesis for studying persistence diagrams is that the underlying distributions from which ρ K and ρ L are drawn are the same. W e later consider a sim ulated examples of this form. 4 2.4 Null Hyp othesis Signiﬁcance T esting W e no w review the algorithm and rationale b ehind n ull h yp othesis signiﬁcance tests. F urther reading can b e found in many introductory and medium-level statistical texts; we mention for example, Casella and Berger (1990), W elsh (1996) and P a witan (2001). The steps for the test are as follows. First, choose a parameter that represents the data in some wa y , and ab out which a p ertinen t h yp othesis can b e formed. P opular examples of parameters include the sample mean and median for tests of location, and the v ariance for tests of spread. Second, choose a statistic to use to estimate the parameter. Third, predict the statistical b ehavior of the statistic under the null h ypothesis, trying to capture the full range of v ariability that is implied by the mo del. Commonly , statistical theory is used to nominate a distribution for the test statistic assuming that the n ull hypothesis is true. F or example, the sample mean migh t b e assumed to hav e a Gaussian distribution, based on the Central Limit Theorem or the assumed distribution of the data. F ourth, compare the observed v alue of the test statistic with the exp ected behavior under the null hypothesis. If the test statistic is anomalous compared with the exp ected b ehavior under the null hypothesis, then it is considered to b e evidence against the null hypothesis. F ormal approac hes to testing diverge at the p oin t of comparison, and w e discuss t wo of them here. One approac h requires nominating a cutoﬀ, called the size of the test, which is by deﬁnition the probability of rejecting the null h yp othesis. That is, the cutoﬀ is set to b e the probability of rejecting the null h yp othesis when it is true. Then, the probabilit y of observing a result as or more extreme than the observ ed test statistic is computed based on repeated experiments, assuming that the null h ypothesis is true. That is, we imagine a set of identical exp erimen ts to b e carried out, for whic h the null h yp othesis is true, and ask what is the proportion of that set for which the computed test statistic is more extreme than the observed v alue, relative to the null h yp othesis. Then we rep ort the outcome of the comparison of the observed probability against the size of the test. This is a Neyman–Pearson approach to testing, and in the classical case where the v ariance is unknown and the hypothesis concerns the mean, the reference distribution is Student’s t distribution, with degrees of freedom equal to the sample size n minus one. Another approac h, following Fisher, simply reports the estimated probability computed ab o ve, called the p v alue. The reader is free to place their own interpretation on the p v alue. W e will follow the latter approac h. In an y case, interpretation of the outcome of the usual NHST is conditional on some mo del, and the hypothesis is stated in terms of parameters of the mo del. It is due diligence for the analyst to ensure that the mo del is a defensible appro ximation to realit y . This is usually p erformed b y examining graphical diagnostics of some quantities that arise from the mo del estimation. F or example, the analyst migh t create histograms of the residuals, which are the diﬀerences b et w een the observ ations and the v alues that w ould ha v e b een observ ed had the data follo w ed the assumed mo del exactly . The p -value is deﬁned as the probability under rep eated equiv alent exp erimentation that a result as or more extreme (relative to the null) would be observed, conditional on the null hypothesis b eing true. T rue p -v alues are nev er zero (with probability one). W e can use p -v alues to p erform n ull hypothesis testing by pic king a threshold α and rejecting the null h ypothesis when the p -v alue is b elow α . 2.5 P o wer of a T est The pow er of a null hypothesis test describ es the num ber of type I and and t yp e I I errors in terms of the comparison of the p v alue with the threshold α . A more p ow erful test is b etter at rejecting the n ull h yp othesis when it indeed is false. The size of the test is the p o w er ev aluated at the null h yp othesis. F or H 0 the n ull h ypothesis, and H 1 the alternativ e w e deﬁne the p ower of the NHST b y p o wer = P  reject H 0   H 1 is true  . Ev en though we are adopting the Fisherian approach to NHST, the p o w er concept is useful as it provides a means of comparing betw een test statistics. In general, so long as the test is un biased (i.e. the size is correctly ac hiev ed at the n ull h yp othesis), w e will prefer to use the test that is uniformly most p o w erful (UMP), or at least lo cally most p o w erful (LMP) in the region of the n ull h yp othesis. In our case w e use a permutation distribution to establish the behaviour of 5 the test statistic under the null hypothesis, so formal considerations suc h as UMP and LMP are not p ossible. Nonetheless, we can appro ximately assess diﬀerent test statistics by using simulation exp erimen ts, and assessing the prop ortion of times that h ypotheses are rejected at a nominal level based on speciﬁed diﬀerences. W e can compare these proportions directly betw een diﬀerent test statistics to provide an idea of the relative merits of the test statistics. 2.6 Randomization T ests Randomization-based tests reliev e the analyst of the need to nominate a formal mo del under the n ull h yp othesis, b y providing an empirical estimate of the distribution of the test statistic under the null hypothesis. That is, instead of nominating a theoretical distribution to use as a basis for comparison with the test statistic, an empirical n ull distribution is created, using simulation. The pro cedure outlined in the follo wing section is a randomization test. W elsh (1996) pro vides a readable introduction. 3 A T est Pro cedure 3.1 Tw o Sets of Lab els (t-test) Assume that we ha v e a collection of n indep endent p ersistence diagrams and a tentativ e lab eling sc heme that divides the collection in to tw o p ossibly dissimilar collections, say X 1 con taining n 1 diagrams and X 2 con taining n 2 diagrams. F or example, w e may conjecture that the p ersistence di- agrams that represen t fMRI data of t w o groups of patien ts – one group with a condition of in terest, and one without — are dissimilar. The assumption of indep endence precludes the p ossibilit y that an y of the observ ations ma y ha ve an inﬂuence on an y of the other observ ations and is important for generating the null distribution. Our goal is to assess the strength of evidence that the pro cesses that generated the collections X 1 and X 2 diﬀer. W e realize this goal in the NHST framework as follows. W e take as the null hypothesis the claim that the lab els are exchangeable; that is, informally , that the current conﬁguration of lab els is no less lik ely than w ould ha v e happened under a random labeling sc heme, relative to the test statistic. An example of this reasoning follows. Giv en three tosses of a fair coin, each p ossible conﬁguration has an iden tical probability — 0 . 125. Ho w ev er, from the p oin t of view of coun ting the n um b er of heads in three tosses, as a test statistic, it is m uc h less lik ely that the coun t will b e three (for a fair coin, 0 . 125) than t w o (for a fair coin, 0 . 375). The same reasoning holds in the prop osed test: even though each p ossible conﬁguration of the label is equally p ossible under the n ull hypothesis, w e conjecture that very many of the random conﬁgurations lead to a v alue of the test statistic that is quite diﬀerent to that in the observed sample. 3.2 T est statistics Randomization tests that are used to compare tw o n umerical samples usually fo cus on some func- tion of the distance of the means of the samples. In the current study , computing the means is exp ensiv e, therefore computing the distance from eac h observ ation to the means for eac h simulated set of lab els will also b e exp ensive. W e therefore instead nominate a function of the within-group pairwise distances as a test statistic. This statistic needs to b e computed only once for each p ossi- ble pair, and b e stored in a table. Then simulation can pro ceed b y summing the distances of pairs of observ ations that are randomly allocated to the same group. When the observ ations are on the real line, and the measure of lo cation is obtained b y min- imising the L2 norm, the lo cation estimate is the mean, and the L2 norm is a monotonic function of the v ariance. In the proposed setup w e hav e tw o putative means and tw o putative v ariances to consider. The join t loss of an y lab elling scheme, conditional on the sample sizes, can b e expressed as the sum of the group-wise v ariances. Hence we prop ose that taking the mean or the sum of the v ariances of the t w o groups would b e a sensible test statistic. The usual expression for the sample v ariance (for sets of real num b ers), which is in the form closest to the L2 norm ev aluated at its minim um, is 6 σ 2 X = 1 n − 1 n X i =1 ( x i − ¯ x ) 2 (2) ho w ev er, an equiv alen t v ariation can b e computed without ﬁrst calculating the mean, namely σ 2 X = 1 2 n ( n − 1) n X i =1 n X j =1 ( x i − x j ) 2 (3) One adv antage of using (3) instead of (2) is that the matrix of pairwise distances only has to b e computed once, and the means of the randomly generated samples are not calculated. Randomly sh uﬄing the group lab els amounts to reading diﬀerent sets of cells from the precalculated distance matrix. F or persistence diagrams with lab eling L into the sets X 1 = { X 1 , 1 , X 1 , 2 , . . . , X 1 ,n 1 } and X 2 = { X 2 , 1 , X 2 , 2 , . . . X 2 ,n 2 } the analogous test statistic is σ 2 X 12 ( L ) = 2 X m =1 1 2 n m ( n m − 1) n m X i =1 n m X j =1 d 2 ( X m,i , X m,j ) 2 (4) where d 2 ( · , · ) is the distance function in (1). The distance b et w een means is not a suitable test statistic for sets of p ersistence diagrams. There is a high computational cost of computing the means for each p erm utation. F urthermore, the F r ´ ec het mean is not necessarily unique which then leads to issues of how to deﬁne this loss function when it is not. 3.3 F amilies of joint loss functions The te st statistic σ 2 X 12 is not the only option for NHST permutation tests. It is an example of a join t loss function. A loss function is a function that maps data onto a real num ber intuitiv ely represen ting some “cost” asso ciated with the mo del and the data. A join t loss function is the sum of tw o or more loss functions corresp onding to the total cost. Just as there are many metrics there are many diﬀerent join t loss functions. W e can use other metrics to construct joint loss functions. Sensible choices include F p,q ( { X 1 ,i } , { X 2 ,i } ) := 2 X m =1 1 2 n m ( n m − 1) n m X i =1 n m X j =1 d p ( X m,i , X m,j ) q where p ∈ [1 , ∞ ], d p ( · , · ) is the distance function in (1) and q ∈ [1 , ∞ ). If the grouping is sensible then these joint loss functions should b e small. In this pap er we will restrict our attention to p = 1 , 2 , ∞ and q = 1 , 2. 4 Mon te-Carlo sim ulation of the p -v alue W e can deﬁne a true p -v alue b y considering how extreme the test statistic is on the observed data compared to the test statistics for every p ossible p erm utation of the lab els. Lemma 1. L et X 1 , 1 , X 1 , 2 , . . . , X 1 ,n 1 and X 2 , 1 , X 2 , 2 , . . . X 2 ,n 2 b e p ersistenc e diagr ams dr awn i.i.d. (the nul l hyp othesis) and let α b e the pr op ortion of al l lab elings L such that T ( L ) ≤ T ( L observe d ) . Then for al l p ∈ [0 , 1] we have P ( α ≤ p ) ≤ p . Pr o of. Consider the list of all possible diﬀerent sets of labels of the set { X 1 , 1 , X 1 , 2 , . . . , X 1 ,n 1 , X 2 , 1 , X 2 , 2 , . . . X 2 ,n 2 } , alongside the observed labels. Order these lab elings b y the cost, lo west ﬁrst, randomly arranging amongst ties. Let the random v ariable W b e the n um ber of diﬀeren t lab els app earing b efore the observ ed lab els. Since under the mo del, the p ersistence diagrams are i.i.d., there is a uniform 7 probabilit y of the location of the original lab eling ov er all the rankings. That is, W has a uniform probabilit y o ver the natural n um bers from 0 to N . This implies that P ( W ≤ k ) = k for all k ∈ { 0 , 1 , . . . N } . F urthermore P ( W / N ≤ p ) ≤ p for all p ∈ [0 , 1]. Note that there is a coupling b et w een W / N and Z with W/ N ≤ Z . They agree except p oten tially in the case where there are m ultiple lab elings with the same cost as the originally observ ed. Using this coupling we conclude that for all p ∈ [0 , 1] P ( Z ≤ p ) ≤ P ( W / N ≤ p ) ≤ p. The p erm utation p -v alue for a giv en lab eling L 0 under the cost function F is deﬁned to be the prop ortion of lab eling L suc h that F ( L ) ≤ F ( L 0 ). This is a true p -v alue b y Lemma 1. The total num ber of p ermutations of the lab els is  n 1 + n 2 n 1  , which is usually far to o large to c hec k exhaustiv ely . W e instead sample from the set of p erm utations with uniform probability . This will pro vide an un biased estimator of the true permutation p -v alue. The randomization NHST algorithm is presented as Algorithm 1. Algorithm 1: An unbiased estimator of the p erm utation p -v alue. Data : n 1 + n 2 p ersistence diagrams with lab els L observed in disjoint sets of size n 1 and n 2 , n um b er of rep etitions N , a joint loss function F Result : estimate of the total p erm utation p v alue initialization - Z=0; Compute F ( L observed ) for the observed lab els; for i = 1 to N − 1 do Randomly shuﬄe the group lab els into disjoint sets of size n 1 and n 2 to give lab eling L ; Compute F ( L ) for the new samples; if F ( L ) ≤ F ( L observe d ) then Z += 1 end end Z / = N ; Output Z The output of Algorithm 1 is an unbiased estimator of the p erm utation p -v alue but is not itself a p -v alue. This ma y at ﬁrst app ear coun terin tuitiv e. Ho w ev er, observe that the output of our estimator could be zero but a p erm utation p -v alue should never b e zero. It m ust be at least 1 /  n 1 + n 2 n 1  as F ( L 0 ) ≤ F ( L 0 ). F or more details as to wh y Z is not a p -v alue see Phipson and Sm yth (2010). Ho w ev er, we can compute use the output of Z to compute a true p -v alue. Let p total b e the p erm utation p -v alue computed using all the p erm utations. The output Z of Algorithm 1 follo ws the distribution Z ∼ B ( n, p true ) n . Theorem 1 (Phipson and Smyth (2010)) . L et Z b e the output of A lgorithm 1. Z + 1 N + 1 is a true p -value. The pro of uses the facts that the usual distribution of the p -v alue under the null h yp othesis is uniform and that Z ∼ B ( n,p true ) n . F or each randomly chosen p erm utation of lab els L there is a p total c hance that T ( L ) ≤ T ( L observed ) and each of the lab el permutations is drawn indep enden tly . By mo difying Algorithm 1 we hav e obtain an algorithm for a true p -v alue. 8 Algorithm 2: p -v alue algorithm for persistence diagrams. Data : n 1 + n 2 p ersistence diagrams with lab els L observed in disjoint sets of size n 1 and n 2 , n um b er of rep etitions N , a joint loss function F Result : p v alue initialization - Z=1; Compute F ( L observed ) for the observed lab els; for i = 1 to N − 1 do Randomly shuﬄe the group lab els into disjoint sets of size n 1 and n 2 to give lab eling L ; Compute F ( L ) for the new samples; if F ( L ) ≤ F ( L observe d ) then Z += 1 end end Z / = ( N + 1); Output Z 5 Examples 5.1 P oin t clouds of nearb y shap es No w let K b e the circle of radius 1 as sho wn in Figure 2. Let M b e t wo concen tric circles with radius 0 . 9 and 1 . 1 as sho wn in Figure 3. W e hav e the Hausdorﬀ distance betw een K and M is 0 . 1. Figure 2: K Figure 3: M Let ρ K ( σ ) denote the distribution of the conv olution of the uniform distribution on K with Gaussian noise N (0 , σ 2 ). Similarly let ρ M ( σ ) denote the distribution of the conv olution of the uniform distribution on M with Gaussian noise N (0 , σ 2 ). W e wan t to compare sets of p ersistence diagrams constructed from p oin t clouds drawn from ρ K ( σ ) to those constructed from point clouds dra wn from ρ M ( σ ) for a range of σ . W e will now describ e the sim ulation pro cedure. First we ﬁxed a σ ∈ { 0 , 0 . 01 , 0 . 02 , 0 . 03 , 0 . 04 , 0 . 05 } . Then w e constructed 100 sets of 10 p oin t clouds with each of the 10 p oin t clouds consisting of 50 i.i.d. p oin ts dra wn from µ K ( σ ). Similarly , we created 100 sets of 10 p oin t clouds drawn using µ M ( σ ). W e then constructed the persistence diagrams in dimension 1 for eac h of these 2000 p oint clouds via a Rips ﬁltration (as describ ed in 2.2) and lab elled eac h with either K or M dep ending on whether the p oin t cloud was drawn using µ K ( σ ) or µ M ( σ ). These persistence diagrams w ere eﬀectiv ely summaries of the c hanges in the space of non-con tractible loops as we considered the union of balls around eac h of the p oin ts in the point cloud as w e increased the radius. W e then to ok 10 p ersistence diagrams lab elled with K and 10 p ersistence diagrams lab elled with M . W e then computed the corresp onding Z from Algorithm 2, repeating the algorithm with the diﬀeren t joint loss functions F (1 , 1) , F (1 , 2) , F (2 , 1) , F (2 , 2) , F ( ∞ , 1) and F ( ∞ , 2) . Here, for each random lab elling L , the joint loss functions are F ( p,q ) ( L = ( { X 1 ,i } , { X 2 ,i } )) = 1 2 · 10 · 9   10 X i,j =1 d p ( X 1 ,i , X 1 ,j ) q + 10 X i,j =1 d p ( X 2 ,i , X 2 ,j ) q .   9 Using the set of 100 v alues of Z for the diﬀerent simulations w e estimated the p o w er of the p erm utation test under these diﬀerent join t loss functions and cutoﬀs of α = 0 . 005 , 0 . 01 , 0 . 05 and 0 . 1. F or a ﬁxed σ , joint loss function and α we counted the n um ber of corresp onding Z suc h that Z < α W e ran this simulation for σ = 0 , 0 . 01 , 0 . 02 , 0 . 03 , 0 . 04 and 0 . 05. The results are tabulated in Figure 4. Each ﬁgure corresp onds to a diﬀerent loss function and the diﬀerent colours corresp ond to diﬀerent levels of α . As w ould b e expected the ability to distinguish the tw o sets of diagrams increases as the sets are further apart and as the num b er of p oin ts in the p oin t cloud increases. An imp ortan t observ ation is that the p o w er v aries amongst the diﬀerent loss functions. In particular the loss functions using the b ottlenec k distance ( F ( ∞ , 1) and F ( ∞ , 2) ) do muc h worse. 0 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 1 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 (a) F (1 , 1) 0 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 1 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 (b) F (1 , 2) 0 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 1 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 (c) F (2 , 1) 0 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 1 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 (d) F (2 , 2) 0 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 1 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 (e) F ( ∞ , 1) 0 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 1 0 . 1 0 . 2 0 . 3 0 . 4 0 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 (f ) F ( ∞ , 2) Figure 4: The p o w er of the p erm utation NHST with cut-oﬀs α = 0 . 001(red), 0 . 005 (orange), 0 . 01 (green), 0 . 05 (blue), and 0 . 1 (purple). The x co ordinate indicates the noise σ and the y -co ordinate the simulated p o w er. 5.2 Distinguishing sets of shap es from silhouette databank In this example we will be using a v ariation on the the theme of a p ersistence diagram as a random elemen t. Given a simplicial complex M in Euclidean space and a unit vector v we can create a ﬁltration of M by the height function h v in the direction of v and hence we can construct a p ersistence diagram X ( K, v ) from the ﬁltration of M b y sublev el sets of h v . The p ersisten t homology transform of M is the function from the sphere of directions to the space of p ersistence diagrams where v is sent to X ( M , v ). This pro cess is explored in detail in T urner et al. (2014b). There it is sho wn that the p ersistent homology transform of a shap e is a suﬃcien t statistic and is 10 stable under p erturbations of the shap e. As such it is reasonable, given sets of shap es, to analyze the sets of their p ersisten t homology transforms. The distance squared betw een the persistent homology transforms of tw o shap es is eﬀectively the integral ov er the unit sphere of the distances squared b et w een the corresp onding diagrams. This pro cess can b e made scale and translation in v ariant b y appropriately mo difying the diagrams X ( M , v ). F urthermore it can b e made rotation inv ariant by taking the inﬁmum of all p ossible rotations. The Lp distance b et w een tw o aligned ob jects in the plane, M 1 and M 2 , in the plane is d p ( M 1 , M 2 ) :=  Z S 1 d p ( X ( M 1 , v ) , X ( M 2 , v )) p dv  1 /p where X ( M i , v ) is the 0 th dimensional persistence diagram corresponding to the height function h v : M i → R , x 7→ x · v . The distance of unaligned ob jects is the minimal distance under diﬀerent rotations. F or more details the reader is referred to T urner et al. (2014b). 1 A shap e database that has b een commonly used in image retriev al is the MPEG-7 shap e silhouette database Sik ora (2001). W e used a subset of this database Latec ki et al. (2000) whic h includes seven class of ob jects: Bone, Heart, Glass, F oun tain, Key , F ork, and Axe. There were t w en t y examples for each class for a total of 1400 shap es. The shap es are display ed in Figure 5. Figure 5: The subset of the silhouette database. Each ro w corresp onds to one of the ob jects: Bone, Heart, Glass, F ountain, Key , F ork, and Axe. Note that although the ob jects are distinct, there is a great deal of v ariation within each ob ject. W e used the perimeters of the silhouettes whic h are a v ailable at Gao (2004). W e applied the alignmen t algorithm we stated in Section 3.3 of T urner et al. (2014b) to shift, scale, and rotate the silhouettes. These p erimeters are all homotopic to a circle so we used the 0-th dimensional p ersisten t homology transform with 64 evenly spaced directions. W e considered both the loss functions F 1 ( { X 1 ,i } , { X 2 ,i } ) := 2 X m =1 1 2 n m ( n m − 1) n m X i =1 n m X j =1 d 1 ( X m,i , X m,j ) and F 2 (( { X 1 ,i } , { X 2 ,i } ) := 2 X m =1 1 2 n m ( n m − 1) n m X i =1 n m X j =1 d 2 ( X m,i , X m,j ) 2 . F or each pair of classes w e then computed the corresp onding p v alues using 10000 rep etitions. The algorithm for eac h diﬀerent pair alwa ys resulted in p < 0 . 0001, using either the F (1 , 1) or the F (2 , 2) loss function. This implies that we exp ect that the distributions of p ersisten t homology transforms are very signiﬁcantly diﬀerent. This result implies that NHST-based classiﬁcation via the p ersistent homology transform should b e p ossible. 5.3 Concurrence T op ology in fMRI data Giv en a set of v ariables and and samples of dichotomized data across those v ariables, concurrence top ology is a metho d of creating ﬁltration of a simplicial complex (and hence also p ersistence 1 The reader should note that in T urner et al. (2014b) the fo cus is on the L1 distance. 11 diagrams) to reﬂect the frequency of when subsets of the v ariables are simultaneously active. This metho d is studied in Ellis and Klein (2012) and applied to fMRI data for b oth sub jects diagnosed with ADHD and healthy controls. W e brieﬂy describ e the pro cedure and refer the reader to Ellis and Klein (2012) for details. Some set lo cations in the brain w ere measured. F or each time in terv al w e get a num ber as- so ciated to how activ e that location in the brain is. These data are dic hotomized by choosing a cutoﬀ v alue. F or each lo cation in the brain w e associate a v ertex v i . W e assign to the vertex v i the v alue of num ber of times that lo cation was active. W e assign to the edge [ v i , v j ] the num ber of times b oth v i and v j w ere active sim ultaneously . Similarly assign to the face [ v i , v j , v k ] the num ber of times all three of v i , v j and v k w ere activ e sim ultaneously . The same pro cess assigns v alues to all simplices in the complete simplicial complex. The ﬁltration is by sup erlev el sets. This is a simpliﬁcation of the pro cedure. As part of the cleaning process some of the lo cations of in the brain are ignored and this set is diﬀerent dep ending on the sub ject. W e calculated the p v alues asso ciated with the sets of p ersistence diagrams in the “default mo de net w ork” that Ellis and Klein computed and kindly provided. In red are the p v alues which T able 1: Output of the Algorithm 2, using loss function F 2 , 2 , with 10000 rep etitions Dimension 0 1 2 3 4 5 ADHD vs Control 0.75272 0.20537 0.50679 0.41815 0.10146 0.01162 ADHD vs Control in F emales 0.68016 0.59175 0.77673 0.90267 0.00588 0.30057 ADHD vs Control in Males 0.46101 0.22070 0.59409 0.48437 0.41364 0.00975 F emales vs Males in Control 0.00930 0.59964 0.33578 0.09851 0.19303 0.26304 F emales vs Males in ADHD 0.48694 0.45473 0.60937 0.59045 0.02443 0.83618 are ≤ 0 . 01. If w e take a signiﬁcance cutoﬀ at p = 0 . 01 then our exp ected false disco v ery rate is m uc h less than 1. A few commen ts should b e made ab out the data set. The fMRI data set was generated at New Y ork Universit y and distributed as part of the 1000 F unctional Connectomes pro ject ( http://fcon1000.projects.nitrc.org/ ). It includes 41 healthy controls (NewY ork a part1) and 25 adults diagnosed with ADHD (NewY ork a ADHD). Unfortunately the samples were highly im balanced with respect to age and gender. Only 20% of the ADHD group w as female, while ab out half of the co n trols w ere. About 25% of the controls w ere children (younger than 20; median age = 12), while there were no children in the ADHD group. Among adults, ages ranged from ab out 21 to ab out 50 in eac h group. The median age in the ADHD group w as 37, while in the control group the median adult age was 27. W e did not control for age while computing the p v alues, and it is not presently clear ho w that could b e done in this context. 6 Discussion and F uture Directions This paper sho ws an example of ho w non-parametric metho ds from statistics can b e adapted for the use with topological summary statistics. There is muc h potential in exploring other non-parametric metho ds. There are a v ariet y of related problems for hypothesis testing. One problem is whether it is p ossible to do alternate h ypothesis testing when the observ ations are p ersistence diagrams. Another area to explore is determining under what circumstances w e can guaran tee the p ow er of the n ull h yp othesis testing pro cedure to distinguish diﬀerent distributions of p ersistence diagrams. This exploration may b e through theoretical results or via simulations. 6.1 More than Tw o Lab el Sets When the goal is to test whether k > 2 groups of observ ations diﬀer, we can use an extension of the test statistic in equation 4 that is analogous to the F statistic in analysis of v ariance. 12 σ 2 x k = k X m =1 1 2 n m ( n m − 1) n m X i =1 n m X j =1 ( x mi − x mj ) 2 (5) This is not the same as the F statistic b ecause we do not propose to compute the b etw een-groups sums of squares, as that is exp ensiv e in this setting. 7 Ac knowledgemen ts W e thank Steve Ellis and Arno Klein for providing us with the p ersistence diagrams pro duced in their work. The authors w ould like to ackno wledge the assistance of the Defence Science Institute in facilitating this work. References Balakrishnan, S., F asy , B., Lecci, F., Rinaldo, A., Singh, A., and W asserman, L. (2013). Statistical inference for p ersisten t homology . arXiv pr eprint arXiv:1303.7117 . Bendic h, P ., Edelsbrunner, H., and Kerb er, M. (2010). Computing robustness and p ersistence for images. Visualization and Computer Gr aphics, IEEE T r ansactions on , 16(6):1251–1260. Bub enik, P . (2015). Statistical topological data analysis using p ersistence landscap es. The Journal of Machine L e arning R ese ar ch , 16(1):77–102. Bub enik, P . and Kim, P . T. (2007). A statistical approach to p ersisten t homology . Homolo gy, Homotopy and Applic ations , 9(2):337–362. Casella, G. and Berger, R. L. (1990). Statistic al Infer enc e . Duxbury Press, Belmont, CA. Cerri, A., F erri, M., and Giorgi, D. (2006). Retriev al of trademark images b y means of size functions. Gr aphic al Mo dels , 68(5):451–471. Chazal, F., Glisse, M., Labru` ere, C., and Michel, B. (2013). Optimal rates of conv ergence for p ersistence diagrams in top ological data analysis. arXiv pr eprint arXiv:1305.6239 . Ellis, S. P . and Klein, A. (2012). Describing high-order statistical dep endence using” concurrence top ology”, with application to functional mri brain data. arXiv pr eprint arXiv:1212.1642 . Gam ble, J. and Heo, G. (2010). Exploring uses of p ersistent homology for statistical analysis of landmark-based shap e data. Journal of Multivariate Analysis , 101(9):2184–2199. Gao, J. X. (2004). Visionlab. WWW. http://visionlab.uta.edu/shape_data.htm . Hatc her, A. (2002). Algebr aic top olo gy . . Latec ki, L. J., Lak amper, R., and Eckhardt, T. (2000). Shap e descriptors for non-rigid shap es with a single closed contour. In Computer Vision and Pattern R e c o gnition, 2000. Pr o c e e dings. IEEE Confer enc e on , v olume 1, pages 424–429. IEEE. Mileyk o, Y., Mukherjee, S., and Harer, J. (2011). Probabilit y measures on the space of persistence diagrams. Inverse Pr oblems , 27(12):124007. P a witan, Y. (2001). In Al l Likeliho o d: Statistic al Mo del ling and Infer enc e Using Likeliho o d . Claren- don Press, Oxford. Phipson, B. and Smyth, G. K. (2010). Perm utation p-v alues should never b e zero: calculating exact p-v alues when p erm utations are randomly drawn. Statistic al applic ations in genetics and mole cular biolo gy , 9(1). 13 Robins, V. and T urner, K. (2015). Principal comp onen t analysis of p ersisten t homology rank functions with case studies of spatial point patterns, sphere packing and colloids. arXiv pr eprint arXiv:1507.01454 . Sik ora, T. (2001). The mp eg-7 visual standard for conten t description—an o v erview. Cir cuits and Systems for Vide o T e chnolo gy, IEEE T r ansactions on , 11(6):696–702. T urner, K. (2013). Means and medians of sets of p ersistence diagrams. arXiv pr eprint arXiv:1307.8300 . T urner, K., Mileyko, Y., Mukherjee, S., and Harer, J. (2014a). F r´ ec het means for distributions of p ersistence diagrams. Discr ete & Computational Ge ometry , 52(1):44–70. T urner, K., Mukherjee, S., and Boy er, D. M. (2014b). Persisten t homology transform for mo deling shap es and surfaces. Information and Infer enc e , 3(4):310–344. W elsh, A. H. (1996). Asp e cts of Statistic al Infer enc e . John Wiley & Sons, Inc., New Y ork, NY. 14

Hypothesis Testing for Topological Data Analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment