hyppo: A Multivariate Hypothesis Testing Python Package

We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and …

Authors: Sambit P, a, Satish Palaniappan

hyppo: A Multivariate Hypothesis Testing Python Package
hyppo : A Multivariate Hypothesis T esting Python P acka ge Sambit P anda 1 , Satish P alaniappan 1 , Junhao Xiong 1 , Eric W . Br idgef ord 2 , Ronak Mehta 1 , Cencheng Shen 3 , and Joshua T . V ogelstein 1 , 4 ∗ Abstract. W e introduce hyppo , a unified librar y f or performing multivariate hypothesis testing, including independence, two- sample, and k -sample testing. While many multivariate independence tests have R packages availab le, the inter- f aces are inconsistent and most are not availab le in Python . hyppo includes many state of the ar t multivariate testing procedures. The package is easy-to-use and is fle xible enough to enab le future e xtensions. The documen- tation and all releases are av ailable at https://h yppo .neurodata.io . Ke y w ords. Python, multivariate, independence , k -sample , h ypothesis 1 Introduction Examining and identifying relationships between sets of high-dimensional v ariables is critical to advance understanding and planning of future numerical and ph ysical experiments. Hy- pothesis testing enables f ormally testing models to identify such discrepancies. Many correlation measures hav e been proposed to solve the prob lem of independence testing, such as P earson’ s correlation [ 1 ], b ut many are unsuited to detect nonlinear and high-dimensional dependence structures within data. Recently , sev eral statistics hav e been proposed that operate well on high-dimensional (potentially non-Euclidean) data, such as distance correlation [ 2 – 5 ] and Hilber t- Schmidt independence criter ion [ 6 – 8 ], which are actually e xactly equivalent in Sejdino vic et al. [ 9 ], Shen and V ogelstein [ 10 ]. Heller , Heller and Gofr ine proposed another nonparametric independence test with par ticularly high po wer in cer tain nonlinear relationships [ 11 ]. Multiscale Graph Correlation is a test that has demonstr ated higher statistical pow er on many multivariate, nonlinear , and structured data when compared to other independence tests [ 12 – 14 ], which combines and e xtends the nearest neighbors and energy statistics to detect underlying relationships. For each of these tests , p-v alues can be calculated using a random per mutation test [ 15 – 17 ]. These tests can be modified and e xtended to such applications as time-series testing [ 18 ]. T o approach the problem of two-sample testing, Student’ s t-test [ 19 ] is traditionally used, while a f ew nonparametric alternatives hav e been proposed that oper ate well on multiv ar iate , nonlinear data such as Energy [ 20 ], and maxmimal mean discrepency [ 21 ], and Heller Heller and Gorfine’ s test [ 11 ]. The two-sample testing problem can be generaliz ed to the k -sample testing prob lem and here analysis of v ar iance (ANO V A) [ 22 ] or its multivariate analogue multiv ariate ANO V A (MANO V A) [ 23 ] can be used, but these statistics either fail to, or operate poorly upon, non-Gaussian data [ 24 , 25 ]. There are a f ew nonparametric alter nativ es to ANO V A and MANO V A, such as multiv ar iate k -sample Heller Heller Gorfine [ 26 ], and distance components (DISCO) [ 27 ]. Recently , P anda et al. [ 28 ] has shown that nonparametric distance and kernel k -sample tests can be f or mulated b y reducing the k -sample testing problem to the independence testing prob lem. This manuscript introduces hyppo , a hypothesis package that provides various tests with high finite-sample statistical po wer on multivariate and nonlinear relationships . hyppo is a well-tested, multi- platf or m, Python 3 compatible librar y that allows users to conduct hypothesis tests on their data, and is also e xtensible enough to allow de velopers to easily add in their own tests. hyppo also provides benchmarks for each of these tests b y comparing statistical power ov er many statistical models. The contribution of this manuscript is therefore to provide: (1) an o verview of the librar y and e xamples of how to use some of the tests in the package , and (2) comparisons of the test statistics and wall times with similar R packages. ∗ Corresponding author : jovo@jhu.edu 1 Johns Hopkins University , Baltimore , MD 2 Johns Hopkins School of Public Health, Baltimore, MD 3 University of Dela ware, New ark, DE 4 Progressiv e Lear ning 1 2 Library Overview Inspired by the desire to allow for conv enient use of these independence tests, hyppo has been dev eloped as a h ypothesis testing package. The pac kage structure is modeled on the scikit-learn and energy R packages’ API. Links to source code, documentation, and tutorials can be f ound here: https://hyppo .neurodata.io . Included T ests We hav e included a host of notab le and nov el h ypothesis tests that we deter mined to be useful f or the end user . Shen and V ogelstein [ 10 ] has shown that distance and kernel methods are equivalent and thus, we have one implementation that is able to perform both with a proper bijec- tiv e transf or mation. We hav e implemented k -sample tests as specified in P anda et al. [ 28 ] and e v er y algorithm in the follo wing list can be used as a two-sample or k -sample test this wa y . The included algorithms are: • Multivariate generalizations of P earson’ s product moment correlation: RV [ 29 , 30 ] and Cannon- ical correlation analysis ( Cca ) [ 31 ]. • Heller-Heller-Gorfine ( Hhg ) [ 11 ]: Multiv ariate distance-based test. • Distance correlation ( Dcorr ): Both biased [ 4 ] and unbiased [ 32 ] and a fast O ( n log n ) variant that runs on Euclidean one-dimensional data [ 33 ]. • Hilber t-Schmidt independence criter ion ( Hsic ): Both biased and unbiased [ 34 ] k ernel-based statistics. A chi-square f ast statistic is also implemented [ 35 ]. • Multiscale graph correlation ( Mgc ) [ 14 ]: An independence tests that combines k -nearest neigh- bors and energy statistics. Recently , Mgc has been accepted into scipy.stats and this implementation wraps the scipy implementation. • Friedman Rafsky [ 36 ]: A tree-based two sample test. • dHsic [ 37 ]: A d-v ariate independence test based on Hsic • Multivariate analysis of variance ( Manova ) [ 24 , 38 ] and Hotelling T 2 ( Hotelling ) [ 39 ]. • Maximum mean discrepancy ( Mmd ) [ 21 ]: A ker nel two-sample test. • Energy [ 20 ]: A distance two-sample test. • Distance components ( Disco ) [ 27 ]: A distance-based k -sample test. • Smooth CF T est [ 40 ]: A test using analytic analogues of characteristic functions. • Mean Embedding T est [ 40 ]: A test based on the analytical mean embeddings between two distributions. • KSample Hhg [ 41 ]: A k -sample test for Hhg . • F ast Conditional Independence T est [ 42 ]: A f ast, nonparametric conditional independence test. • Kernel Conditional Independence T est [ 43 ]: An efficient, kernel conditional independence test. • Finite Set Stein Discrepancy [ 44 ]: A linear time k er nel goodness of fit test. • Discr iminability test [ 45 ]: A highly accurate and pow erful discriminability test. • Partial Dcorr [ 46 ]: A method to perf or m conditional independence testing using distance cor- relation. • Conditional Dcorr [ 47 ]: Conditional independence testing using Dcorr with strong theoretical proper ties. • LjungBox [ 48 ]: T ests if groups of autocorrelations of time series are different from 0. A number of algor ithms have been implemented that lac k an open source implementation elsewhere. These include: • Kernel mean embedding random f orest ( Kmerf ) [ 49 ]: A kernel test that lev erages random f orest kernel induced similarity matr ix to generate a test statistic. • F ast Implementations of Dcorr ( Fast Dcorr ) [ 50 ]: An approximation to Dcorr when calcu- lating the p-v alue. • Universally consistent k -sample tests via independence testing [ 28 ]: T ransf or ms the k -sample test- ing problem into the independence testing problem and then uses non-parametric indepen- dence tests from hyppo . • Time-ser ies Mgc and Dcorr : Applying Mgc and Dcorr to time-ser ies data. • Maximal Margin Correlation ( MaxMargin ) [ 50 ]: A highly accurate formulation of independence 2 1 0 2 1 0 3 1 0 4 Number of Samples 1 0 4 1 0 2 1 0 0 1 0 2 1 0 4 Execution Time (Seconds) MGC (hyppo) HHG (hyppo) Dcorr (hyppo) MMD (hyppo) Fast Dcorr (hyppo) Fast 1D Dcorr (hyppo) HHG (HHG) Dcorr (energy) MMD (kernlab) Dcorr MMD HHG -5e-4 0 5e-4 Test Statistic Difference Figure 1: Benchmar ks of hyppo implementations against corresponding R implementations for tests in the inde- pendence testing module. Av erage wall times (ov er 3 repetitions) (left) are shown for Dcorr in energy , Mmd in kernlab , and Hhg in Hhg as compared against hyppo implementations of Mgc , Hhg , Dcorr , Mmd , and Fast Dcorr . T est statistic comparisons (r ight) between Dcorr , Mmd , and Hhg in hyppo are compared against their respective ref erence R implementations. T est statistics are near ly identical f or each implementation. tests in high dimensions with minimal added computational comple xity . Structure of hyppo The modules of hyppo are as follows: independence , d_variate , ksample , discrim , kgof time_series , and tools . Each test within hyppo contains a .test method which the user runs that returns at least a statistic and p-v alue in all cases, and parallelization of the per muta- tion tests are perf or med using joblib . T est statistic code is compiled and cached with numba , which is a just-in-time (JIT) compiler . tools contains common helper functions, simulations, and finite-sample statistical pow er functions to test differences betw een each test in hyppo . 3 Benchmarks W all Time Comparisons Figure 1 a sho ws the computational efficiency of hyppo ’ s implementations against existing implementations in commonly used R packages—specifically energy [ 51 ], kernlab [ 52 ], and HHG [ 53 ]. When comparing perf or mance , wall times are a ver ages of p-value computations (1000 replications when per mutation tests are used) 3 tr ials calculated on a univ ariate noisy linear simulation with number of samples increasing from 50 to 10,000. All computations w ere performed on an Ub untu 18.04.3 L TS system with access to 96 cores . When sample siz es are abo v e a f ew hundred, all algorithms achie ve approximately quadratic times, with diff erent slopes. Hhg was the slow est as e xpected, though had comparab le speeds to the other algorithms at low sample siz es. Mgc and Dcorr are next, and still only requires tens of minutes to r un when sample sizes are around 10,000. At low sample sizes, the energy package’ s Dcorr is faster than kernlab ’ s implementation of Mmd ( Dcorr is equivalent to Mmd for all finite sample sizes [ 10 ]) e ven at a sample size of 10,000. hyppo ’ s Fast Dcorr , which uses a f ast statistic [ 33 ] and p-value approximation [ 35 ] is the fastest, e ven though both energy and kernlab both use highly optimized C++ versions . Implementation V alidation Next, we v er ify that hyppo ’ s test statistics are equivalent to existing R im- plementations of the tests. Specifically , hyppo ’ s implementations were compared to: Dcorr from the energy package [ 51 ]. Mmd from the kernlab package [ 52 ], and Hhg from the HHG package [ 53 ]. The e valuation uses a spiral simulation with 1000 samples and 2 dimensions f or each test and compares test statistics ov er 20 repetitions. Figure 1 b shows the difference between the hyppo implementation 3 of the independence test and the respective R package implementation of the independence test. T est statistics are nearly equivalent f or each implementation. 4 Conclusion hyppo is an extensiv e and extensib le open-source Python package for multiv ariate h ypothesis testing. As hyppo continues to grow and add functionality , it will enhance tools scientists use when deter mining relationships within their in v estigations. Ackno wledgements This work is graciously suppor ted by the Defense Adv anced Research Projects Agency (DARP A) Lif elong Lear ning Machines program through contract F A8650-18-2-7834, the Na- tional Institute of Health aw ards RO1MH120482 amd T32GM119998, and the National Science Foun- dation award DMS-1921310. The authors would like to acknowledge the NeuroData Design class and the NeuroData lab at Johns Hopkins Univ ersity for helpful f eedback. 4 References [1] Kar l P earson. Note on regression and inher itance in the case of two parents. Proceedings of the Ro yal Society of London , 58(347-352):240–242, 1895. [2] Gábor J Székely and Maria L Rizz o . Brownian distance co variance. The Annals of Applied Statis- tics , 3(4):1236–1265, 2009. [3] Gábor J Szék ely and Mar ia L Rizzo . The distance correlation t-test of independence in high dimension. Jour nal of Multiv ariate Analysis , 117:193–213, 2013. [4] Gábor J Székely , Mar ia L Rizz o , and Nail K Bakirov . Measuring and testing dependence b y corre- lation of distances. The Annals of Statistics , 35(6):2769–2794, 2007. [5] Russell L yons . Distance cov ar iance in metr ic spaces . The Annals of Probability , 41(5):3284–3305, 2013. [6] Ar thur Gretton and Györfi László. Consistent nonparametric tests of independence. Jour nal of Machine Lear ning Research , 11(Apr):1391–1423, 2010. [7] Ar thur Gretton, Ralf Herbrich, Ale xander Smola, Olivier Bousquet, and Bernhard Schölk opf. K er nel methods for measuring independence. Jour nal of Machine Learning Research , 6(Dec):2075– 2129, 2005. [8] Kr ikamol Muandet, K enji Fukumizu, Bharath Sriper umb udur , and Bernhard Schölk opf . K er nel mean embedding of distributions: A revie w and be yond. Foundations and T rends® in Machine Lear ning , 10(1-2):1–141, 2017. [9] Dino Sejdinovic, Bhar ath Sriper umbudur , Ar thur Gretton, and K enji Fukumizu. Equiv alence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics , 41(5): 2263–2291, 2013. [10] Cencheng Shen and Joshua T . V ogelstein. The exact equivalence of distance and k er nel methods in h ypothesis testing. AStA Advances in Statistical Analysis , 105(3):385–403, 2021. [11] Ruth Heller , Y air Heller , and Malka Gorfine. A consistent multiv ariate test of association based on ranks of distances . Biometr ika , 100(2):503–510, 2012. [12] Cencheng Shen, Carey E Priebe, and Joshua T V ogelstein. F rom distance correlation to m ultiscale graph correlation. Journal of the Amer ican Statistical Association , 115(529):280–291, 2020. [13] Y oujin Lee, Cencheng Shen, Carey E Priebe, and Joshua T V ogelstein. Network dependence testing via diffusion maps and distance-based correlations. Biometrika , 106(4):857–873, 2019. [14] Joshua T V ogelstein, Er ic W Bridgeford, Qing W ang, Carey E Priebe, Mauro Maggioni, and Cencheng Shen. Discov ering and deciphering relationships across disparate data modalities. eLif e , 8:e41690, 2019. [15] Dave S Collingridge. A pr imer on quantitiz ed data analysis and permutation testing. Jour nal of Mix ed Methods Research , 7(1):81–97, 2013. [16] Meyer Dw ass. Modified randomization tests f or nonparametric hypotheses . The Annals of Mathe- matical Statistics , pages 181–187, 1957. [17] Phillip I Good. P er mutation, parametric, and bootstrap tests of hypotheses . Spr inger Science & Business Media, 2006. [18] Cencheng Shen, Jaew on Chung, Ronak Mehta, Ting Xu, and Joshua T V ogelstein. Independence testing for temporal data. T ransactions on Machine Learning Research , 2024. ISSN 2835-8856. URL https://openre view .net/forum?id=jv1aPQINc4 . [19] Student. The probab le error of a mean. Biometr ika , pages 1–25, 1908. [20] Gábor J Székely and Maria L Rizzo . Energy statistics: A class of statistics based on distances. Jour nal of statistical planning and inf erence , 143(8):1249–1272, 2013. [21] Ar thur Gretton, Karsten M Borgwardt, Malte J Rasch, Ber nhard Schölkopf , and Alexander Smola. A kernel two-sample test. Jour nal of Machine Learning Research , 13(Mar):723–773, 2012. [22] Ronald A Fisher . Xv .—the correlation between relatives on the supposition of mendelian inheri- tance. Ear th and Environmental Science T ransactions of the Roy al Society of Edinb urgh , 52(2): 399–433, 1919. 5 [23] Maur ice S Bar tlett. Multivariate analysis. Supplement to the jour nal of the ro yal statistical society , 9(2):176–197, 1947. [24] Russell W ar ne. A pr imer on multiv ariate analysis of variance (manov a) for behavioral scientists. Practical Assessment, Research, and Ev aluation , 19(1):17, 2014. [25] JP Stev ens. Applied multiv ariate statistics f or the social sciences. la wrence erlbaum. Mahwah, NJ , pages 510–1, 2002. [26] Ruth Heller , Y air Heller , Shachar Kaufman, Barak Brill, and Malka Gorfine. Consistent distribution- free k-sample and independence tests for univ ariate random variables. The Jour nal of Machine Lear ning Research , 17(1):978–1031, 2016. [27] Mar ia L Rizzo , Gábor J Székely , et al. Disco analysis: A nonparametric extension of analysis of v ariance. The Annals of Applied Statistics , 4(2):1034–1055, 2010. [28] Sambit P anda, Cencheng Shen, Ronan P err y , Jelle Zor n, Antoine Lutz, Carey E. Priebe, and Joshua T . V ogelstein. High-dimensional and universally consistent k-sample tests. arXiv prepr int arXiv:1910.08883 , 2024. [29] Yves Escoufier . Le traitement des variables v ector ielles. Biometrics , pages 751–760, 1973. [30] Paul Rober t and Yves Escoufier . A unifying tool for linear multiv ariate statistical methods: the r v- coefficient. Jour nal of the Roy al Statistical Society: Series C (Applied Statistics) , 25(3):257–265, 1976. [31] David R Hardoon, Sandor Szedmak, and John Shawe-T a ylor . Canonical correlation analysis: An ov er view with application to learning methods. Neural computation , 16(12):2639–2664, 2004. [32] Gábor J Székely , Mar ia L Rizzo , et al. P ar tial distance correlation with methods for dissimilarities. The Annals of Statistics , 42(6):2382–2412, 2014. [33] Ar in Chaudhuri and W enhao Hu. A f ast algorithm f or computing distance correlation. Computational Statistics & Data Analysis , 135:15 – 24, 2019. ISSN 0167-9473. doi: h ttps://doi.org/10.1016/j.csda.2019.01.016 . URL http://www .sciencedirect.com/science/ar ticle/ pii/S0167947319300313 . [34] Ar thur Gretton, Kenji Fukumizu, Choon H T eo , Le Song, Ber nhard Schölkopf , and Alex J Smola. A kernel statistical test of independence. In Advances in neural information processing systems , pages 585–592, 2008. [35] Cencheng Shen, Sambit Panda, and Joshua T . V ogelstein. The chi-square test of distance corre- lation. Jour nal of Computational and Graphical Statistics , 31(1):254–262, 2022. [36] Jerome H F riedman and Lawrence C Rafsky . Multiv ariate generalizations of the w ald-wolf owitz and smir no v two-sample tests. The Annals of Statistics , pages 697–717, 1979. [37] Niklas Pfister , P eter Bühlmann, Bernhard Schölkopf , and Jonas P eters. K er nel-based tests for joint independence. Jour nal of the Roy al Statistical Society Ser ies B: Statistical Methodology , 80 (1):5–31, 2018. [38] Gregor y Carey . Multivariate analysis of v ar iance (manov a): I. theory . Retr ie ved May , 14:2011, 1998. [39] Harold Hotelling. The generalization of student’ s ratio . In Breakthroughs in statistics , pages 54–65. Springer , 1992. [40] Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic , and Ar thur Gretton. F ast two-sample testing with analytic representations of probability measures. Advances in Neur al Inf or mation Processing Systems , 28, 2015. [41] Ruth Heller and Y air Heller . Multiv ar iate tests of association based on univariate tests. Advances in Neural Inf or mation Processing Systems , 29, 2016. [42] Krzysztof Chalupka, Pietro P erona, and F reder ick Eberhardt. F ast conditional independence test f or vector v ar iab les with large sample sizes . arXiv prepr int , 2018. [43] Kun Zhang, Jonas P eters, Dominik Janzing, and Ber nhard Schölkopf . K er nel-based conditional independence test and application in causal discov er y . arXiv prepr int , 2012. [44] Wittawat Jitkr ittum, Wenkai Xu, Zoltán Szabó, Kenji Fukumizu, and Ar thur Gretton. A linear-time 6 kernel goodness-of-fit test. Advances in neural inf or mation processing systems , 30, 2017. [45] Zeyi W ang, Eric Bridgeford, Shangsi W ang, Joshua T . V ogelstein, and Br ian Caff o . Statistical analysis of data repeatability measures, 2020. [46] Gábor J. Székely and Mar ia L. Rizzo . P ar tial distance correlation with methods for dissimilarities. The Annals of Statistics , 42(6):2382–2412, December 2014. ISSN 0090-5364, 2168-8966. doi: 10.1214/14- AOS1255 . [47] Xueqin W ang, W enliang P an, W enhao Hu, Y uan Tian, and Heping Zhang. Conditional distance correlation. Jour nal of the American Statistical Association , 110(512):1726–1734, 2015. [48] Greta M Ljung and George EP Box. On a measure of lac k of fit in time series models. Biometrika , 65(2):297–303, 1978. [49] Cencheng Shen, Sambit P anda, and Joshua T . V ogelstein. Lear ning inter pretable characteristic kernels via decision f orests. arXiv preprint , 2024. [50] Cencheng Shen and Y ue xiao Dong. High-dimensional independence testing via maximum and av erage distance correlations. arXiv preprint , 2024. [51] Mar ia Rizzo and Gabor Szek ely . energy: E-Statistics: Multivariate Inference via the Energy of Data , 2018. URL https://CRAN.R- project.org/package=energy . R package v ersion 1.7-5. [52] Alexandros Karatzoglou, Alex Smola, K ur t Hornik, and Achim Zeileis . kernlab – an S4 package f or kernel methods in R. Jour nal of Statistical Software , 11(9):1–20, 2004. URL http://www .jstatsoft. org/v11/i09/ . [53] Barak Br ill and Shachar Kaufman. HHG: Heller-Heller-Gorfine T ests of Independence and Equality of Distributions , 2019. URL https://CRAN.R- project.org/package=HHG . R package v ersion 2.3.2. 7

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment