Density Estimation with Distribution Element Trees

Statistics and Computing man uscript The ﬁnal publication is a v ailable at Springer via h ttp://dx.doi.org/10.1007/s11222-017-9751-9 Densit y Estimation with Distribution Elemen t T rees Daniel W. Mey er Received: 06/11/2016 / Accepted: 02/05/2017 Abstract The estimation of probabilit y densities based on a v ailable data is a central task in many statistical applications. Esp ecially in the case of large ensembles with man y samples or high-dimensional sample spaces, computationally eﬃcien t methods are needed. W e pro- p ose a new metho d that is based on a decomp osition of the unkno wn distribution in terms of so-called distri- bution elemen ts (DEs). These elements enable an adap- tiv e and hierarc hical discretization of the sample space with small or large elements in regions with smoothly or highly v ariable densities, resp ectively . The nov el reﬁne- men t strategy that we prop ose is based on statistical go odness-of-ﬁt and pair-wise (as an approximation to m utual) indep endence tests that ev aluate the lo cal ap- pro ximation of the distribution in terms of DEs. The capabilities of our new method are insp ected based on sev eral examples of diﬀeren t dimensionalit y and suc- cessfully compared with other state-of-the-art density estimators. Keyw ords nonparametric densit y estimation · adaptiv e histogram · k ernel density estimation · adaptiv e binning · p olynomial histogram · curse of The author is grateful to Marco W eibel for his help during the preparation of this manuscript. V ery v aluable feedback from an asso ciate editor and tw o reviewers and helpful input from Oliver Brenner and Florian M¨ uller are gratefully ac knowl- edged. Moreov er, the author ackno wledges helpful comments from Nina Roth and feedbac k on the initial version of this man uscript from Patric k Jenn y , both from ETH Z¨ urich. The author has b een ﬁnancially supp orted by ETH Z ¨ uric h. Daniel W. Meyer Institute of Fluid Dynamics ETH Z ¨ uric h T el.: +41-44-633-9273 F ax: +41-44-632-1147 E-mail: meyerda@ethz.c h dimensionalit y · high dimensional · big data · P´ oly a tree · densit y estimation tree Mathematics Sub ject Classiﬁcation (2000) 62G07 · 62H10 · 62G10 1 Introduction In this w ork, we prop ose a new metho d for estimating a probabilit y density p ( x ) of the random v ariable v ec- tor X at position x = ( x 1 , . . . , x d ) > of the bounded, d -dimensional probability space Ω based on a given ensem ble of n samples x 1 , . . . , x n . Unlike other meth- o ds, our approach is applicable for large and/or high- dimensional datasets. Density estimation metho ds are essen tially categorized into parametric and non-parametric metho ds (e.g., Haerdle et al 2004; Scott 2015). While w e focus in this work on the developmen t of a non- parametric approac h, our metho d can b e viewed as a h ybrid b et ween the tw o categories (Y ang 2008). There- fore, before we fo cus on non-parametric methods, we start the following literature review by brieﬂy explain- ing the concept of parametric densit y estimation. In parametric density estimation, a parametric den- sit y mo del p ( x | θ ) is given in analytical form. With the parameters θ = ( θ 1 , θ 2 , . . . ) estimated from the av ail- able data, we arriv e at a complete density c haracteriza- tion. This characterization can b e ev aluated with sta- tistical tests or more precisely comp osite go o dness-of-ﬁt tests such as for example the χ 2 test of P earson (1900). Here, for a giv en signiﬁcance level α , we either do not reject the null hypothesis p ( x ) = p ( x | θ ) and reject the alternativ e p ( x ) 6 = p ( x | θ ) or vice v ersa. F or small en- sem bles, these tests lose their p o w er meaning that they b ecome unable to detect alternatives and instead do 2 Daniel W. Meyer not reject the null hypothesis (e.g., Steele and Chasel- ing 2006). Parametric density estimation is of limited generalit y , since a predeﬁned density mo del is needed. Non-parametric densit y estimation is more general as no parametric densit y mo del is required. A widely used representativ e from this category of densit y esti- mators is the histogram (Fix and Ho dges 1951; Scott 2015, section 3). Here, the probabilit y space is t ypically discretized into equally-spaced bins and the probabil- it y density within one bin is set prop ortional to the n umber of samples in that bin. T o determine the bin width h , the mean integrated square error (MISE), or more precisely its asymptotic approximation referred to as AMISE, is minimized for cases with kno wn p ( x ); leading for example with a Gaussian p ( x ) to an h giv en b y the so-called normal reference rule (Haerdle et al 2004, equation (2.26)). Histograms are conceptually sim- ple, but hav e certain drawbac ks. First of all, they are relativ ely inaccurate and with the ov erall num ber of bins growing exp onentially with the num b er of dimen- sions d , conv entional histograms b ecome prohibitive in cases with large d (Scott 2015, section 3.4.1). More eco- nomical and p ossibly accurate v ariants with adaptiv e bin widths are av ailable (e.g., Kogure 1987). These v ari- an ts attempt to b etter resolve regions with large densit y v ariation with ﬁner bins, while using larger bins in more uniform areas. Adaptiv e histograms rely on estimates of gradien ts of p ( x ) or the use of p ercen tile meshes (Scott 2015, section 3.2.8). One p ossible generalization of histograms are cu- bic log-splines (Ko op erberg and Stone 1991). Here, the one-dimensional probability space is partitioned into bins lik e in a histogram, but within a bin, a paramet- ric mo del in the form of a cubic p olynomial or spline is applied. The bin b ounds or knots are placed sub- ject to a rule that was deriv ed based on exp erience in ﬁtting log-spline mo dels (Ko operb erg and Stone 1991, section 5.1). The coupled p olynomial coeﬃcients or spline parameters in turn are determined numerically with a Newton–Raphson method by maximizing the likeli- ho od of the log-spline estimator. The ov erall n um b er of knots is either calculated by a predeﬁned rule or se- quen tial knot-deletion. A related approach that is ap- plied in m ulti-v ariate settings are so-called p olynomial histograms. Here, the density in a bin v aries accord- ing to a linear, quadratic, or higher order p olynomial (Scott and Sagae 1997; Jing et al 2012). Unlik e in the log-spline metho d, the p olynomial co eﬃcien ts are deter- mined lo cally based on statistical moments estimated within individual bins (conditional momen ts). The use of higher order p olynomials enables the use of larger bins globally while main taining the same MISE com- pared to con v entional histograms (Jing et al 2012, ta- ble 1). This is b ecause to some extent the density v ari- ation is accounted for already at the lev el of an individ- ual bin. Esp ecially in cases where d is large, this reduces computational costs. So far, equidistant and prescrib ed non-equidistan t bin grids were considered (Jing et al 2012; Zaunders et al 2016). Besides histograms, a second imp ortan t category of non-parametric metho ds is k ernel density estimation (KDE) (Rosen blatt 1956; Sheather 2004). Here, sam- ples are not grouped into bins, but are equipped with k ernel functions, e.g., triangular, Gaussian, etc. (Sil- v erman 1998, table 3.1). Similarly like the bin width in histograms, k ernels hav e a certain support or band- width h . Optimal global bandwidths were determined with AMISE analysis based on second order deriv atives of p ( x ) (e.g., Scott 2015, equations (6.18) and (6.50)). Adaptiv e more accurate methods that reduce the band- width in dense areas and use wider k ernels in sparse regions hav e b een documented (e.g., Loftsgaarden and Quesen b erry 1965; Achilleos and Delaigle 2012), but are not completely satisfactory y et. F or example Scott (2015, section 6.8) summarized quite re cen tly: ‘Adap- tiv e metho ds hold muc h promise, but usually introduce man y new parameters that are diﬃcult to estimate, and frequen tly in tro duce artifacts of the sample (rather than the underlying density).’ Certain eﬃcient spectral KDE implementations rely on bandwidths that are con- stan t within diﬀeren t directions x i in probability space (e.g., O’Brien et al 2016; Silverman 1998, section 3.5). The highly-cited KDE metho d of Botev et al (2010) on the other hand is adaptiv e, is av ailable in the form of an eﬃcient sp ectral implemen tation (Botev 2007), and reduces b oundary bias eﬀects of existing KDE ap- proac hes. Recently , a new KDE metho d for b ounded domains that eliminates b oundary bias issues was pre- sen ted in the context of functional data analysis by Pe- tersen and Muller (2016). A high-dimensional imple- men tation of the metho d of Botev et al (2010) was re- cen tly made av ailable, where the cost p er density query scales as an exp onential fraction of the num ber of sam- ples (Botev 2016). Alternativ e KDE metho ds were sum- marized in (P ark and Marron 1990; P ark and T urlac h 1992; Cao et al 1994; Jones et al 1996; Silverman 1998; Haerdle et al 2004; Scott 2015). A densit y estimator for exp onen tial families—whic h is a broad class of densities—that has sup erior conv er- gence prop erties compared to KDE was presen ted by Srip erum budur et al (2013). This estimator is based on minimizing the Fisher div ergence and requires the solu- tion of an ( nd + 1) × ( nd + 1) linear system. Most inter- estingly , the adv an tage to KDE gro ws with increasing dimensionalit y d as was shown numerically . Distribution Element T rees 3 Densit y estimation based on mixture distributions can b e viewed as a generalization of KDE, where the unkno wn distribution is expressed lik e in KDE as a sup erposition of probability densities (e.g., W ang and W ang 2015). These densities are referred to as mixture mo dels and each mixture mo del is asso ciated with an ensem ble subset containing—unlik e in KDE—multiple samples. In this context, the so-called Dirichlet pro cess mixture mo dels, that go back to (F erguson 1973), hav e receiv ed renew ed attention after progress w as made in the n umerical estimation of mixture parameters (e.g., Neal 2000). T o arriv e at an eﬃcient adaptive metho d, tree-based approac hes hav e b een prop osed more recently (e.g., Ram and Gray 2011; W ong and Ma 2010; Jiang et al 2016). These approaches start from the probability space C = Ω , whic h is typically assumed to be a d -dimensional h y- p ercube. The ro ot cub e Ω , or cub oid to b e more precise, is recursively sub divided in to smaller cuboids, e.g., C 1 and C 2 with C = C 1 S C 2 , based on suitable conditions. More sp eciﬁcally , Ram and Gray (2011) discussed den- sit y estimation trees or shorter densit y trees that are deriv ed from decision trees (Breiman et al 1984) and are constructed based on optimal split op erations of cub oids at tree no des. These splits are optimal in the sense that they maximally reduce the in tegral square error (ISE), i.e., ISE( C ) > ISE( C 1 ) + ISE( C 2 ). The optim um is found from all p ossible splits in each di- mension. The large num b er of possibilities renders the metho d exp ensive for large datasets (Ram and Gray 2011, equation (9)). A preset low est threshold is set to stop the splitting. The splitting pro cess is follow ed by a tree pruning and cross-v alidation step. The resulting tree or more precisely its leafs pro vide a histogram with adaptiv e bin widths. The tree structure enables a fast densit y estimation at a cost prop ortional to the tree depth. W ong and Ma (2010) and Jiang et al (2016), on the other hand, hav e introduced and n umerically im- plemen ted, resp ectiv ely , the optional P´ oly a tree (OPT) metho d. In this approach, cub oids C are partitioned and uniform cub oid probability densities p ( x | C i ) (or q ( x | C i ) in their work) are assigned according to proba- bilistic pro cesses inv olving Bernoulli and Dirichlet ran- dom num b ers, resp ectively . The random partitioning pro cess relies on selection and stopping probabilities, λ i ( C | x ) and ρ ( C | x ), resp ectiv ely , that are calculated based on a recursiv e expression (equation (2.1) in Jiang et al 2016) that necessitates partitioning down to sub- regions with either zero or one sample. This leads to a close to exp onential growth in computing time as a function of the num b er of dimensions (Jiang et al 2016, ﬁgure 1). As a remedy , in naiv e inexact OPT (NI-OPT) partitioning limits were introduced like smallest num- b er of p oints in subregions or smallest size of subre- gions to arrive at manageable trees. Moreov er, limited- lo ok ahead OPT (LL-OPT) introduces t wo additional tuning parameters that control the tree depth for the re- cursiv e calculation and th us help to reduce memory re- quiremen ts and computing times of OPT and NI-OPT. The density estimates p ( x ) that result from OPT imple- men tations are piecewise uniform within subregions C or bins and are adaptiv e within individual dimensions. T o relate back to our initial discussion on parametric metho ds, Ma and W ong (2011) outline an OPT metho d for go odness-of-ﬁt testing of large datasets against a giv en base distribution p 0 (or q 0 in their work). Here, in- stead of uniform cub oid probability densities, p 0 is used, but otherwise the OPT metho dology of W ong and Ma (2010) applies. As a measure of the o verall goo dness-of- ﬁt, the in tegral stopping probability ρ ( Ω ) w as prop osed (Ma and W ong 2011, section 3). In the present contribution, w e develop a new non- parametric density estimator that is adaptive and cost eﬃcien t. In view of the so-called curse of dimensionality , that is diminishing MISE conv ergence rates for increas- ing dimensionality (e.g., Scott 2015, equation (3.67)), metho ds that enable the treatment of large datasets at small computational costs b ecome imp ortant. While our metho d adopts certain features from the previously in tro duced p olynomial histograms and tree-based tech- niques, our new metho d is, how ev er, conceptually and algorithmically simpler and computationally more eﬃ- cien t compared to these approaches. Our dev elopment starts by recognizing that a his- togram is essen tially a collection of disjoin t piecewise uniform distributions. In a histogram with equally-sized bins, dep ending on the true density distribution, a uni- form appro ximation ma y be accurate in certain bins, while inaccurate in others. Like in the context of para- metric metho ds, we could apply a statistical test to ev aluate the go o dness-of-ﬁt of the data in individual bins against a uniform null hypothesis. Thus a natu- ral recursive wa y of constructin g a histogram emerges: Ω is deﬁned as a ro ot hypercub oid that encloses the a v ailable data. With a suitable go o dness-of-ﬁt test, it is tested whether the uniform n ull hypothesis is adopted at a giv en signiﬁcance lev el based on the data in the cub oid. If not, the root cuboid is split and the testing and splitting is recursively rep eated for all cuboids re- sulting from this and subsequent splits. Ev entually the test will adopt the uniform hypothesis in all resulting sub cuboids. This is b ecause the pow er of the test dimin- ishes as the num b er of samples in each subcub oid b e- comes smaller during conse cutiv e splitting. Finally an adaptiv e histogram results, where in eac h bin the uni- 4 Daniel W. Meyer form distribution is supp orted by a p ositiv e outcome of a go o dness-of-ﬁt test. This approac h is not en tirely non- parametric as a signiﬁcance level has to be prescrib ed. The nested spatial arrangement of cub oids and sub- cub oids, that is organized in a tree structure, enables the fast query of densit y estimates lo cally . T o further reduce the num ber of bins, a next step is to apply instead of uniform or constant bin-densities linear or higher-order densities similar to p olynomial histograms. The corresponding p olynomial coeﬃcients are deriv ed from the bin data and a comp osite go odness- of-ﬁt test is applied. In this sense, the proposed metho d is a h ybrid b et ween parametric and non-parametric den- sit y estimation. With the density inside a bin b eing a p olynomial of a certain order, that approximates the densit y distribution lo cally , we refer to it as a distribu- tion element (DE). Since the resulting densit y estimate is given in the form of a tree with DEs at its leafs, w e refer to our approac h as DE tree (DET) densit y estima- tor. In the following section 2, we pro vide details ab out the formulation of the DET estimator and in section 3 presen t comparativ e applications. Concluding remarks are pro vided in section 4. 2 F ormulation W e aim at accurately estimating the probability density p ( x ) at diﬀerent positions x = ( x 1 , . . . , x d ) > ∈ Ω based on a given ensemble including n samples x 1 , . . . , x n . The probability space Ω is deﬁned by a h yp ercub oid, i.e., Ω = d Y i =1 [ x i,l , x i,u ] , with lo w er and upper b ounds x i,l and x i,u , respectively , of comp onents x i suc h that x j ∈ Ω ∀ j = 1 , 2 , . . . , n . This condition is for example met if x i,l = n min j =1 x i,j and x i,u = n max j =1 x i,j , where x i,j refers to the i th comp onent of sample x j . 1 In our metho d, Ω is split recursively into m smaller h yp ercub oids or simply cub oids C k = d Y i =1 [ x k i,l , x k i,u ] ⊂ Ω with k = 1 , 2 , . . . , m . The cub oids C k are disjoin t and satisfy Ω = S m k =1 C k . A cub oid C k comprises together 1 Fixing the domain b ounds based on the data range leads to b ounds that are almost certainly to o narrow. Accordingly , the resulting density estimates will display a bias tow ard to o high v alues. with a lo cal density p k ( x ) the k th DE. Based on all DEs, the DET densit y estimator is then given by p ( x ) = m X k =1 p k ( x ) ∀ x ∈ Ω . (1) In the following section 2.1, we deﬁne p k ( x ) and in sec- tion 2.2, we introduce the recursive splitting metho d that leads to the DET densit y estimator. 2.1 Distribution Elemen ts Similar to a bin in a histogram, a DE can b e viewed as the least complex building blo c k or atom of a density distribution estimate or DET. Therefore, we chose a simple analytical form that is suitable to approximate the density in a small subregion C k of Ω . W e deﬁne the probabilit y density of DE k as p k ( x ) =      n ( C k ) n d Y i =1 p [ x i | θ i ( C k )] ∀ x ∈ C k 0 otherwise, (2) where p [ x i | θ i ( C k )] are marginal densities of comp onen ts x i with lo cal parameter vectors θ i and n ( C k ) /n is the fraction of all samples n that reside in C k . Therefore in each DE, the random v ariables X i are approximated as statistically indep enden t. This has imp ortan t impli- cations, since it enables us to break the exp onen tial gro wth of bins in terms of d as will b ecome clear in the next section. Insertion of expression (2) in equation (1) and inte- gration o v er the entire probabilit y space Ω reveals that, since P m k =1 n ( C k ) = n , the DET estimator integrates to one and therefore satisﬁes the normalization condi- tion of a probabilit y densit y function (PDF). Moreov er, if p [ x i | θ i ( C k )] ≥ 0 ∀ x i ∈ [ x k i,l , x k i,u ], the DET estimator is non-negativ e for x ∈ Ω and therefore is a PDF. With the uniform marginal densit y p [ x i | θ i ( C k )] = 1 x k i,u − x k i,l , (3) whic h deﬁnes a so-called constant DE, we obtain from equation (2) p k ( x ) = n ( C k ) n 1 Q d i =1 ( x k i,u − x k i,l ) , whic h is the familiar density in a histogram bin (e.g., Scott 2015, section 3.4). Similarly , the density of a linear DE is giv en as p [ x i | θ i ( C k )] =  x i − x k i,l x k i,u − x k i,l − 1 2  θ i, 1 ( C k ) + 1 x k i,u − x k i,l , (4) Distribution Element T rees 5 where θ i, 1 ( C k ) is a slop e parameter. W e notice that a linear DE with θ i, 1 = 0 is equiv alent to a constan t DE. T o estimate the slop e based on the data p oin ts inside C k , w e apply the following minimum mean square error (MMSE) estimator θ i, 1 ( C k ) = n ( C k ) s i ( C k ) 3 n ( C k ) s i ( C k ) 2 + 144 h X 0 2 i | C k i (5) with s i ( C k ) = 6(2 h X i | C k i − 1). Here, the quantities h X i | C k i and h X 0 2 i | C k i are the mean and v ariance es- timates based on comp onent x i of the samples con- tained in cuboid C k . A deriv ation of the MMSE slope estimator (5) is included in the app endix. T o make sure that p [ x i | θ i ( C k )] ≥ 0, we clip θ i, 1 ( C k ) suc h that θ i, 1 ( C k ) ∈ [ − 2 , 2]. In this w ork, w e fo cus on DEs with uniform and linear marginal PDFs, but higher order DEs are conceiv able as well. It is p oin ted out that the p olynomials (2) are dif- feren tly constructed compared to the ones applied b y Jing et al (2012, equations (5) and (6)). Our DE classi- ﬁcation is based on the marginal densities p [ x i | θ i ( C k )], while Jing et al classify the expanded p olynomial p k ( x ) and allow for statistical dep endence among the comp o- nen ts x i . 2.2 Distribution Elemen t T ree F or the construction of the DET density estimator (1), a constant or linear DE is assigned to the probability space Ω or ro ot cuboid dep ending on the requ ired or- der. T o v erify for the ro ot DE or any subsequent DE k whether the data included in Ω or C k are compatible with DE density (2), we apply go o dness-of-ﬁt tests for the marginal distributions and test independence of the join t distribution. This testing approach is directly im- plied by the structure of the DE density (2). (a) If one or sev eral of these tests fail, meaning that the n ull hypoth- esis giv en b y DE density (2) is rejected, the distribution of the data within C k is most lik ely more complex (and there is suﬃcient data within DE k for rejecting). Con- sequen tly , DE k is considered in terim and a split along one or tw o probability space directions is conducted. Next, the testing and splitting pro cess contin ues for eac h of the resulting tw o or four DEs. On the other hand, (b) if DE k passes all tests or con tains no data, the splitting process stops, DE k is ﬁnal, and b ecomes one of the m DEs that are part of the DET density estimator (1). During the outlined estimator construc- tion process, a tree is emerging with its ro ot giv en b y the DE on Ω , branc hes to in terim DEs resulting from DE splits, and ﬁnal DEs at the leafs of the tree. More details ab out the testing and splitting pro cesses follow in the next few paragraphs. 2.2.1 Splitting If a split of DE k on cub oid C k along dimension x i is conducted, cuboid C k is split either into sub cuboids with equal volume (equal size split) or sub cuboids with appro ximately equal num ber of samples (equal score split). If splits in directions x i and x j are to b e con- ducted, w e ﬁrst split along direction x i and then split eac h of the tw o resulting DEs along x j . 2.2.2 Go o dness-of-Fit T esting T o verify whether the data in DE k are compatible with densit y (2), we apply a tw o-stage testing sequence. First, we apply go o dness-of-ﬁt tests in each of the d di- rections x i to verify whether the marginal distributions of the data are compatible with the n ull hypothesis giv en by the marginal DE densities (3) or (4), dep end- ing on the DE order. If in one or several directions the n ull hypothesis is rejected, we split along the rejecting direction with the smallest p -v alue. In this work, w e fo- cus on Pearson’s χ 2 go odness-of-ﬁt test (Pearson 1900) with signiﬁcance lev el α g . This test was originally developed for categorical data, but is often used for contin uous v ariables as well with the contin uous data grouped into classes (e.g., Mann and W ald 1942). W e deﬁne the classes such that the n umber of samples in n c = min " n 5 , 4 5 r 2( n − 1) 2 c 2 # (6) classes, with c = √ 2 erfcinv(2 α g ), is appro ximately equal. Here, erfcin v is the in v erse complemen tary error func- tion. In expression (6), the ﬁrst contribution n/ 5 is the rule of th umb, stating that each class should contain at least ﬁve samples (Co c hran 1952, section 7), and the second con tribution w as proposed by Mann and W ald (1942, theorem 1) and maximizes the p ow er of the χ 2 test under certain conditions. With constant and linear DEs, the χ 2 test is conducted with n c − 1 and n c − 2 degrees of freedom, resp ectiv ely (comp osite test). T o inv estigate the inﬂuence of the go o dness-of-ﬁt test on the p erformance of the DET metho d, w e ap- ply in section 3.5 in addition to the previously men- tioned χ 2 go odness-of-ﬁt test a Kolmogorov–Smirno v (KS) test (Smirnov 1948). Even though the KS test is a non-parametric test in the sense that no discrete classes ha ve to b e prescrib ed, w e fav or the χ 2 test for reasons of computational eﬃciency and conv enience in cases where large samples and comp osite tests are in- v olved, resp ectiv ely . 6 Daniel W. Meyer 2.2.3 Indep endenc e T esting Second, if d > 1 and all tests concerning the marginal distributions are passed without any split, we verify whether the components of the data in DE k are pair- wise independent and thus approximately compatible with DE distribution (1). W e p oin t out that this ap- proac h is approximate, since (m utually) indep enden t random v ariables, as represen ted by distribution (1), are pairwise indep endent, but for d > 2 pairwise in- dep endence does generally not imply m utual indep en- dence (Papoulis 1991, p. 184). If not stated otherwise, w e apply Pearson’s χ 2 indep endence test (P earson 1900) with signiﬁcance level α d for pairs of comp onents x i and x j with i = 1 , 2 , . . . , d − 1 and j = i + 1 , i + 2 , . . . , d . W e use contingency tables with classes suc h that the samples are equally distributed among √ n c classes in the x i - and x j -direction (Bagnato et al 2012, section 3.1). Moreov er, a χ 2 distribution with ( √ n c − 1) 2 degrees of freedom is applied. T o inspect the role of the indep endence test statistic on the p erformance of the DET metho d, w e apply in section 3.5 for comparison an indep endence test based on Kendall’s τ (Kendall 1938). How ever, for reasons of computational eﬃciency in high-dimensional cases, we fa vor the χ 2 indep endence test. If for one or several component pairs the indep en- dence n ull hypothesis is rejected, DE splitting is done along the comp onen ts of the rejecting pair with the smallest p -v alue. Within component pairs, the comp o- nen t with smallest go o dness-of-ﬁt p -v alue is split ﬁrst. 2.2.4 Computational Cost The outlined approximate testing sequence a v oids the initiation of sim ultaneous splits in more than t w o direc- tions and therefore, the DET estimator is not sub ject to the exp onen tial gro wth of bins in terms of d . At each DE in the tree, (a) d go o dness-of-ﬁt tests are conducted and if d > 1 and no split was induced by these tests, (b) of the order of d 2 / 2 pair-wise indep endence tests follow. This testing sequence is applied at each tree no de, that is each interim and ﬁnal DE. In the case of score-based splitting, an upper bound for the num b er of interim and ﬁnal DEs, that determines the DET construction cost, exists. If eac h ﬁnal DE contains just one sample, the maximal p ossible num b er of ﬁnal DEs m = n . More- o ver, if n can b e expressed as an in teger-v alued pow er of tw o, the n umber of all interim and ﬁnal DEs is equal to 1 + 2 + 4 + 8 + . . . + n/ 2 + n = 2 n − 1 . (7) Here, the sum starts with 1 representing the ro ot DE and after m ultiple generations of binary splits ends with n ﬁnal DEs. If n cannot b e written as an integer p o wer of t wo, the next larger in teger that is a p o w er of tw o is n 0 = 2 ceil[log 2 ( n )] , whic h satisﬁes n < n 0 < 2 n . Then based on result (7), an upp er b ound for the num b er of all interim and ﬁnal DEs is given by 2 n 0 − 1 < 4 n − 1. T o summarize, when applying equal score splits, the computational eﬀort for the DET construction scales in the w orst case linearly with the num ber of samples n . T o ev aluate the density estimate (1) at a certain lo- cation x , i.e., p ( x ), the sum (1) in volving m ﬁnal DEs could be ev aluated. A more eﬃcien t approac h, how ever, exploits the tree structure of the DET estimator: The only non-zero term in sum (1) can b e quic kly iden tiﬁed b y starting at the root DE and by sequentially identi- fying at the DET forks the interim DE l where x ∈ C l . After of the order of n t decision op erations, where n t is the num b er of DE splits or tree depth, the leaf or ﬁnal DE k is reac hed that con tains p oint x . Typically n t  m < n , which renders the DET estimator com- putationally eﬃcient for densit y queries. In the next few sections, we do cumen t the capabilities of the DET densit y estimator for a range of test cases. 3 Density Estimation with the DET Method F or the follo wing computations, if not mentioned other- wise, signiﬁcance lev els α g = α d = 0 . 001 w ere applied. All computations were carried out on a state-of-the-art laptop computer with a 2.8GHz Intel Core i7 pro ces- sor. W e insp ect the accuracy of the DET estimator for a diverse set of one-, t w o-, four-, and seven-dimensional cases in the following three sections. T o this end, we estimate the MISE deﬁned as  Z Ω [ ˆ p ( x ) − p ( x )] 2 d x  , (8) where ˆ p ( x ) with x ∈ Ω , given for example by the DET metho d (1), is an estimate of the exact PDF p ( x ) based on an ensemble of samples and angular brac k ets rep- resen t the exp ectation with resp ect to that ensemble. T o put our results into p ersp ectiv e with state-of-the- art density estimation, we compare against the adap- tiv e KDE metho d of Botev et al (2010), the density- estimation-tree or in short density-tree metho d of Ram and Gra y (2011), the LL-OPT estimator of Jiang et al (2016), and ﬁnally a conv en tional histogram, where the bin width in eac h dimension w as determined based on the normal reference rule (Scott 2015, equation (3.66)). T o numerically ev aluate the integration o ver prob- abilit y space Ω in the MISE (8), we applied for d = 1 and 2 trap ezoidal rules with equidistant Cartesian grids. As an exception to this, an exp onen tially stretc hed grid w as applied for the one-dimensional gamma PDF Distribution Element T rees 7 example (following in section 3.1.6) to b etter resolve the region of high probability near the origin. T o assert the accuracy of the resulting MISE, adaptive quadra- ture methods as outlined b y Shampine (2008b,a) w ere applied as well, leading to virtually the same results as with trap ezoidal rules. In connection with the DET es- timator, here the integration region Ω was decomposed among all DEs and adaptive quadrature was applied in each element individually . F or the four- and sev en- dimensional cases, quadrature or trapezoidal rules b e- come to o exp ensive and we resorted to Monte Carlo (MC) integration instead. T o this end, we rewrite the MISE (8) as  Z Ω  ˆ p ( x ) 2 p ( x ) − 2 ˆ p ( x ) + p ( x )  p ( x ) d x  = * lim n →∞ 1 n n X j =1  ˆ p ( x j ) 2 p ( x j ) − 2 ˆ p ( x j ) + p ( x j )  + , (9) with samples x j distributed as prescrib ed b y PDF p ( x ). Computationally eﬃcien t sp ectral KDE implemen- tations for d = 1 and 2, that op erate on equidistant Cartesian grids, are av ailable (Botev 2007). Since these implemen tations become prohibitiv e for d > 2, an alter- nativ e w as recently provided by Botev (2016). How ev er, here similar to conv en tional KDE the computational eﬀort per densit y ev aluation depends on n . A densit y- tree implementation is a v ailable via the MLP A CK li- brary (Curtin et al 2013). F or the LL-OPT computa- tions, the implementation cited by Jiang et al (2016, section 7) was applied. W e used the outlined imple- men tations with prescrib ed standard parameter v alues. In the following one- and tw o-dimensional exam- ples, MISE ensem ble a verages h . . . i were approximated based on 50 samples. Moreo v er, for the DET method, based on these samples, av erage num b er of DEs h m i and tree depths h n t i are reported. Here, n t for one DET sample corresp onds to the maximal depth within the tree. W e start our assessment of the DET metho d, by comparing in the following sections 3.1 and 3.2 the dif- feren t DET v arian ts to the established KDE metho d of Botev et al (2010). In a next set of comparisons con- tained in section 3.3, we fo cus on linear DETs with equal-size splits and compare with density trees, LL- OPTs, and histograms. 3.1 One-Dimensional Examples If not stated otherwise, for the follo wing one-dimensional adaptiv e KDE, a grid with 2 14 = 16384 no des or cosine mo des w as deplo y ed (Botev 2007). The follo wing one- dimensional examples in v olving normal mixture densi- ties w ere taken from (Marron and W and 1992). 3.1.1 Kurtotic Unimo dal PDF First, w e inspect the p erformance of the DET estimator for the kurtotic unimo dal distribution no. 4 in their pap er, i.e., p ( x ) = 2 3 N (0 , 1) + 1 3 N (0 , 1 / 10) , (10) where N ( µ, σ ) is the normal distribution with mean µ and standard deviation σ . Exemplary estimates result- ing from the DET metho d and KDE are depicted to- gether with expression (10) in Fig. 1. The DET method adaptiv ely allocate s ﬁner bins in the region where part N (0 , 1 / 10) dominates and resorts to larger DEs to rep- resen t N (0 , 1). The corresp onding MISEs are depicted in Fig. 2(a) as a function of the n um ber of samples n . A p o w er law decay with exp onents b et ween − 1 / 2 and -1 is observ ed for all DET estimators and the adaptiv e KDE. The linear DET metho ds and adaptiv e KDE share ap- pro ximately the same deca y rate, with KDE having a smaller MISE than the DET v ariants. Besides the inferior MISE decay rate, constant DEs require more DEs as is seen in Fig. 2(b). Here, the mean n umber of constan t and linear DEs gro ws approx- imately with n 1 / 4 and n 1 / 8 for large n . The num ber of DEs h m i determines the computational cost asso ciated with the DET construction. These growth rates are indep enden t of the splitting metho d. While the DET splitting metho d has little eﬀect on h m i and the MISE, the tree growth, measured by h n t i , is smaller for size- vs. score-based splitting (see Fig. 2(c)). How ev er, in b oth cases, h n t i increases logarithmically with n to a go o d appro ximation. With m and n t b eing discrete v ariables and the DE splitting b eing a discrete pro cess, the DET datasets in Fig. 2 are not as smooth as for example the MISE decay in the adaptive KDE. Deviations from regular scalings in the presen t, and more so in the next examples, are esp ecially apparen t for small n , where the DETs are comprised of few splitting levels and DEs. In Fig. 2, exemplary 95% conﬁdence in terv als (giv en a Gaussian lik eliho o d mo del) illustrate that the ensembles of den- sit y estimator realizations used in this work are suﬃ- cien tly large to keep statistical uncertainties small. 3.1.2 Outlier PDF In a next step, w e study the DET estimator for the outlier distribution no. 5 from (Marron and W and 1992) giv en by p ( x ) = 1 10 N (0 , 1) + 9 10 N (0 , 1 / 10) . (11) The PDF estimates and resulting MISE are plotted in Fig. 3. Except for the slightly worse MISE of the DET v ariants compared with KDE, all previous observ ations 8 Daniel W. Meyer -3 -2 -1 0 1 2 3 x 0 0.5 1 1.5 2 p(x) KDE DET ref. (a1) -3 -2 -1 0 1 2 3 x 0 0.5 1 1.5 2 p(x) KDE DET ref. (b1) -3 -2 -1 0 1 2 3 x 0 0.5 1 1.5 2 p(x) KDE DET ref. (a2) -3 -2 -1 0 1 2 3 x 0 0.5 1 1.5 2 p(x) KDE DET ref. (b2) Fig. 1 PDF estimates resulting from (blue thin solid) adaptiv e KDE and (red thic k solid) the size-split DET metho d with particle ensem bles including (1) n = 100 and (2) 10 4 samples are compared with (black dash dot) the reference PDF (10). In panels (a) and (b), DET estimates with constant and linear elements are depicted, resp ectively . from the kurtotic unimo dal distribution (10) carry ov er to the presen t case. 3.1.3 Asymmetric Claw PDF The third test distribution is the asymmetric cla w dis- tribution no. 12 (Marron and W and 1992), i.e., p ( x ) = 1 2 N (0 , 1) + 2 X l = − 2 2 1 − l 31 N ( l + 1 2 , 2 − l / 10) . (12) Estimates of this PDF based on the size-based DET estimators and adaptive KDE are included in Fig. 4(a) and illustrate the adaptiv e bin width selection of the DET metho d based on the lo cal sample density . The MISE decay given in panel (b) is slightly closer to the adaptiv e KDE compared with the previous cases. Again a logarithmic growth of the tree depth as a function of the ensem ble size n is reco v ered with similar depths like in the previous examples as seen in Fig. 2(c). 3.1.4 Spiky Uniforms PDF The previous test cases hav e fo cused on Gaussian mix- tures. The next three examples deal with diﬀeren t PDFs lik e the spiky uniforms distribution taken from (W ong and Ma 2010, example 6) and giv en by p ( x ) = 1 2 U (0 . 23 , 0 . 232) + 1 2 U (0 . 233 , 0 . 235) , (13) where U ( a, b ) is a uniform distribution deﬁned on the in terv al [ a, b ]. DET-based densit y estimates and adap- tiv e KDE are compared in Fig. 5. W e can observe that while the DET metho ds tend to pro duce oscillations at the bounds of the uniforms, the adaptiv e KDE is sub- ject to noise apparent in the constant sections of the uniforms. This diﬃculty is detected as w ell by the MISE plotted in Fig. 6(a), where the DET metho ds conv erge faster with approximately 1 /n to the true density com- pared to adaptive KDE with 1 / √ n . The DET metho ds b ecome more accurate than KDE for ensembles with n > 10 3 . The mean tree depth of the DET metho d sho wn in Fig. 6(b) and the mean num ber of DEs (not sho wn) display logarithmic and sublinear gro wth as ob- serv ed previously . As an exception to all one-dimensional cases considered, for the spiky uniform distribution, there is no adv antage in terms of MISE conv ergence, n umber of DEs, and tree depth when using linear vs. constan t DEs. Giv en the piecewise constan t form of the PDF, this is exp ected. Distribution Element T rees 9 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 10 -1 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 10 1 10 2 mean number of DEs m lin. size const. size lin. score const. score n 0.125,0.25,0.5,1 (b) 10 2 10 4 10 6 n 5 10 15 mean tree depth n t lin. size const. size lin. score const. score log(n 0.5,1,2,4 ) (c) 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 mean Hellinger dist. lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (d) Fig. 2 Estimation of kurtotic unimo dal PDF (10). Evolutions of (a) the MISE and (d) the mean Hellinger distance as a function of the num ber of samples n for DET estimators (red solid and blue dashed) with equal size (symbols) and score splits (no sym b ols) are plotted. The MISE and mean Hellinger distance resulting from the adaptive KDE is included as well (blac k dash dot). F or the DET estimators the resulting mean n um b ers of DEs h m i and mean tree depths h n t i are giv en in panels (b) and (c), resp ectively . Po w er law and logarithmic scalings with exp onen ts indicated in the ﬁgure legends are depicted (gray thick solid). Exemplary 95% conﬁdence interv als are provided for the linear DET estimator with size splits. -5 0 5 x 10 -5 10 0 p(x) KDE DET ref. (a) 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 10 -1 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (b) Fig. 3 (a) PDF estimates resulting from (blue thin solid) adaptiv e KDE and (red thick solid) the size-split linear DET metho d based on a particle ensem bles with n = 10 6 samples are compared with (black dash dot) the reference PDF (11). The corresponding MISE evolution as a function of the n umber of samples n is shown in panel (b) like in Fig. 2(a). 10 Daniel W. Meyer -3 -2 -1 0 1 2 3 x 0 0.1 0.2 0.3 0.4 0.5 p(x) KDE DET ref. (a1) 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (b) -3 -2 -1 0 1 2 3 x 0 0.1 0.2 0.3 0.4 0.5 p(x) KDE DET ref. (a2) 10 2 10 4 10 6 n 5 10 15 mean tree depth n t lin. size const. size lin. score const. score log(n 0.5,1,2,4 ) (c) Fig. 4 (a) PDF estimates resulting from (blue thin solid) adaptiv e KDE and (red thick solid) size-split (1) constant and (2) linear DETs based on a particle ensem bles with n = 10 4 samples are compared with (black dash dot) the reference PDF (12). The corresp onding MISE evolution and mean tree depth as a function of the num b er of samples n are shown in panels (b) and (c) like in Fig. 2. 3.1.5 Beta PDF In our ﬁfth example, we p erform density estimation based on data deriv ed from the b eta PDF p ( x ) = Γ ( α + β ) Γ ( α ) Γ ( β ) x α − 1 (1 − x ) β − 1 (14) with x ∈ [0 , 1], parameters α = 1 . 05 and β = 0 . 8, and where Γ ( x ) is the gamma function. Exemplary resulting estimates are depicted in Fig. 7. While in the ﬂat re- gion of PDF (14) the constant DET estimator allo cates few large elements, increasingly ﬁne bins are placed to- w ards the peak at x = 1. The linear DET estimator b eha v es similarly but requires fewer elements. In com- parison with adaptiv e KDE, one can observe in Fig. 8(a) that the DET metho ds conv erge faster and that the linear DET v arian ts are more accurate. Unlike in the previous examples, it b ecomes apparen t from Fig. 8(b) that the score-split DET v arian ts lead to smaller trees than their size-split coun terparts. 3.1.6 Gamma PDF The last one-dimensional example considered in this w ork is the gamma PDF p ( x ) = x a − 1 e − x/b b a Γ ( a ) (15) with x going from 0 to ∞ and parameters a = 2 / 3 and b = 50. As is visible from Fig. 9, this PDF is skew ed and go es to inﬁnit y for x → 0. The MISE results included in the ﬁgure do cumen t the accuracy of the DET esti- mators. The abilit y of the adaptive KDE to resolv e the singularit y of PDF (15) becomes for n ≈ 10 4 . 5 limited b y the num ber of grid p oints or cosine modes applied for its calculation. Increasing this num ber from 2 14 to 2 17 = 131072 shifts the resolution limit to larger n as is shown in Fig. 9(b). F or the gamma PDF as for the b eta PDF, score-based splitting is slightly more eﬀec- tiv e than size-based splitting leading to smaller trees with few er DEs. Distribution Element T rees 11 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 x 0 50 100 150 200 250 300 p(x) KDE DET ref. (a1) 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 x 0 50 100 150 200 250 300 350 p(x) KDE DET ref. (b1) 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 x 0 50 100 150 200 250 300 p(x) KDE DET ref. (a2) 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 x 0 50 100 150 200 250 300 350 p(x) KDE DET ref. (b2) Fig. 5 PDF estimates resulting from (blue thin solid) adaptiv e KDE and (red thic k solid) the size-split DET metho d with particle ensem bles including (1) n = 100 and (2) 10 5 samples are compared with (black dash dot) the reference PDF (13). In panels (a) and (b), DET estimates with constant and linear elements are depicted, resp ectively . 10 2 10 4 10 6 n 10 -2 10 0 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 5 10 15 20 mean tree depth n t lin. size const. size lin. score const. score log(n 0.5,1,2,4 ) (b) Fig. 6 Estimation of spiky uniforms PDF (13). (a) Evolutions of the MISE as a function of the num ber of samples n for DET estimators (red solid and blue dashed) with equal size (symbols) and score splits (no symbols) are plotted. The MISE resulting from the adaptive KDE is included as well (black dash dot). F or the DET estimators the resulting mean tree depths h n t i are giv en in panel (b). Po w er law and logarithmic scalings with exp onents indicated in the ﬁgure legends are depicted (gray thic k solid). 12 Daniel W. Meyer 0 0.2 0.4 0.6 0.8 1 x 0 1 2 3 4 5 6 p(x) KDE DET ref. (a1) 0 0.2 0.4 0.6 0.8 1 x 0 1 2 3 4 5 6 p(x) KDE DET ref. (b1) 0 0.2 0.4 0.6 0.8 1 x 0 1 2 3 4 5 6 p(x) KDE DET ref. (a2) 0 0.2 0.4 0.6 0.8 1 x 0 1 2 3 4 5 6 7 p(x) KDE DET ref. (b2) Fig. 7 PDF estimates resulting from (blue thin solid) adaptiv e KDE and (red thic k solid) the size-split DET metho d with particle ensem bles including (1) n = 10 4 and (2) 10 6 samples are compared with (black dash dot) the gamma PDF (14). In panels (a) and (b), DET estimates with constant and linear elements are depicted, resp ectively . 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 10 -1 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 5 10 15 mean tree depth n t lin. size const. size lin. score const. score log(n 0.5,1,2,4 ) (b) Fig. 8 Estimation of b eta PDF (14). See Fig. 6. 3.2 Tw o-Dimensional Examples F or the adaptive KDEs in the follo wing tw o-dimensional cases, a grid with 2 10 × 2 10 = 1048576 no des was de- plo yed (Botev 2007). 3.2.1 Bi-V ariate Gaussian PDF Before lo oking at more complex bi-v ariate distributions, w e consider the joint Gaussian PDF p ( x ) = exp  − 1 2 ( x − µ ) > C − 1 ( x − µ )  p (2 π ) 2 det( C ) (16) Distribution Element T rees 13 10 0 10 2 x 10 -6 10 -4 10 -2 10 0 p(x) KDE DET ref. (a) 10 2 10 4 10 6 n 10 -4 10 -3 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 KDE 2 17 (b) Fig. 9 (a) PDF estimates resulting from (blue thin solid) adaptiv e KDE and (red thick solid) the size-split linear DET metho d based on a particle ensem bles with n = 10 3 samples are compared with (black dash dot) the reference PDF (15). The corresponding MISE evolution as a function of the num ber of samples n is shown in panel (b) like in Fig. 2(a). In addition to the adaptive KDE with 2 14 grid p oints (black dash dot), MISE results based on n 17 p oin ts (black dash dot with sym b ol) are pro vided. x 2 5 -5 x 1 8 -8 ref. KDE DET 0 0.1 0.2 (a) x 2 5 -5 x 1 8 -8 ref. KDE DET 0 0.1 0.2 (b) Fig. 10 PDF estimates resulting from adaptive KDE and the size-split linear DET metho d based on particle ensembles with (a) n = 10 4 and (b) n ≈ 10 6 . 5 samples are compared with the reference PDF (16) . 14 Daniel W. Meyer 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 10 2 10 3 mean number of DEs m lin. size const. size lin. score const. score n 0.25,0.5,1 (b) 10 2 10 4 10 6 n 5 10 15 20 mean tree depth n t lin. size const. size lin. score const. score log(n 0.5,1,2,4 ) (c) Fig. 11 Estimation of bi-v ariate Gaussian PDF (16). See Fig. 2. In panel (c), the data series from the score-split DETs coincide. with unbound probability space x = ( x 1 , x 2 ) > , mean v ector µ = (0 , 0) > , and co v ariance matrix C =  4 . 0 − 2 . 28 − 2 . 28 1 . 44  . Gaussian PDFs are imp ortant in many applications, whic h emphasizes the present case. Resulting size-split linear DET estimates are compared with KDE and joint PDF (16) in Fig. 10. The adaptivity of the DET metho d is illustrated in panel (a), where regions of small den- sit y v ariation are represented by large DEs and sec- tions with high v ariation were subdivided into sev eral smaller elemen ts. 2 The deca y of the MISE depicted in Fig. 11(a) is similar to the one-dimensional Gaussian mixture cases, but the conv ergence rates are slightly smaller (compare for example with Fig. 2). Again, adap- tiv e KDE is more accurate than the DET v ariants and 2 In the singular case of linearly dep endent comp onen ts x 1 and x 2 , man y small DEs, resolving the probabilit y peak along the diagonal of the x 1 - x 2 -space, result from a DET estimator. the DE tree growth is logarithmic as seen in Fig. 11(c). Moreo ver for large n , the mean num b er of DEs increases sublinearly with appro ximately n 1 / 4 (see Fig. 11(b)). Fig. 10(a) reveals a w eakness of the present DET metho d in comparison to KDE, as the DET estimators or DET parametrizations will dep end on the orientation of the coordinate system. In case of alignmen t of the join t Gaussian PDF (16) with the x 1 - x 2 -co ordinate sys- tem, the DET estimators will require fewer DEs and will b e more accurate. This implies the use of a cov ariance- matrix-based principle axes transform (e.g., Silverman 1998, equation (4.7)), to ac hieve parametrization in- v ariance and increased accuracy and computational ef- ﬁciency . An illustration is giv en in Fig. 12, where results from DET estimators com bined with principle axes trans- forms are provided. Ho wev er, in order to demonstrate the v ersatility of the DET method, this approach is not pursued further in the presen t work. Distribution Element T rees 15 5 -5 x 1 8 -8 0 0.1 0.2 x 2 (a) 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (b) Fig. 12 Estimation of bi-v ariate Gaussian PDF (16). (a) PDF estimate resulting from the linear size-split DET metho d combined with a principle axes transform and based on a particle ensemble with n = 10 4 samples. (b) Ev olutions of the corresponding MISE for the indicated DET estimators and KDE as a function of the n umber of samples n . All DET v arian ts w ere combined with a principle axes transform. 3.2.2 Uniform on an El lipse Next, the tw o-dimensional case with uniform density on an ellipse outlined in (Botev et al 2010, p. 2942) is insp ected. The PDF is given by p ( x ) =  1 /π ∀ x ∈ { x | x 2 1 + (4 x 2 ) 2 ≤ 4 } 0 otherwise (17) and in Fig. 13 KDE and DET-based estimation results based on ensem bles stemming from this PDF are pro- vided. Like in the one-dimensional case with uniform PDFs (see Fig. 5(a2) and (b2)), constan t DEs lead to smaller oscillations at the in terface, where PDF (17) switc hes from 0 to 1 /π (compare ranges of color bars in panels (1) and (2) of Fig. 13). Ho wev er, while the adap- tiv e KDE displays a noisy density estimate within the ellipse, the DET v ariants capture the constant density quite accurately . This is reﬂected in the MISE results sho wn in Fig. 14(a), where for n > 10 5 the DET es- timators con v erge faster to the true PDF and b ecome more accurate compared with adaptiv e KDE. Giv en the piecewise constan t PDF (17), linear DEs are as go o d as constan t elements and like the MISE, the mean n um- b er of DEs h m i sho wn in Fig. 14(b) and the mean tree depth h n t i (not shown) b eha ve similarly for b oth DE t yp es. F or large n , h m i increases approximately as √ n and h n t i (not shown) grows similarly as in the one- dimensional spiky uniforms case (see Fig. 6(b)). One might argue that—in the absence of a princi- ple axes transform—the presen t setup, with the ellipse aligned with the co ordinate system, is in fav or of the DET method. In an additional study the p erformance of the DET metho d was insp ected for the ellipse rotated b y an angle of π / 4. With this mo diﬁed setup, similar results were found, with the DET v ariants surpassing KDE at n ≈ 2 × 10 5 and with h m i starting from 6 el- emen ts at n = 100 and gro wing to similar n um b ers as in the unrotated case. 3.2.3 Bi-V ariate Dirichlet PDF The third bi-v ariate case addressed in this w ork is the Diric hlet PDF given by p ( x ) = Γ ( α 1 + α 2 + α 3 ) Γ ( α 1 ) Γ ( α 2 ) Γ ( α 3 ) x α 1 − 1 1 x α 2 − 1 2 (1 − x 1 − x 2 ) α 3 − 1 (18) ∀ x ∈ { x | x 1 + x 2 ≤ 1 ∨ x 1 ≥ 0 ∨ x 2 ≥ 0 } and = 0 otherwise, with parameters α 1 = 0 . 9, α 2 = 1 . 5, and α 3 = 3. The Diric hlet PDF (18) is a bi-v ariate general- ization of the uni-v ariate b eta PDF (14). PDF (18) is depicted together with KDE results and DET estimates for tw o diﬀeren tly-sized ensembles in Fig. 15. The DET metho d pro vides for b oth ensembles estimates that are in go od agreemen t with the true density . MISE results are compared in Fig. 16(a) and it is seen that lik e in the b eta PDF example, the linear DET v arian ts are most accurate, follow ed by their constan t counterparts and adaptiv e KDE. As seen in Fig. 16(b), the size-split DET metho ds produce the smallest trees with the lin- ear DET approac h being the most eﬃcien t. The mean n umber of DEs h m i (not shown) increases with n lik e in the bi-v ariate Gaussian case. The linear DEs and size-based splitting hav e the smallest h m i and all DET v ariants scale approximately with n 1 / 4 for n large (see Fig. 11(b)). Rotating the bi-v ariate Dirichlet PDF b y π / 4 has a bigger impact on the DET p erformance as in the pre- 16 Daniel W. Meyer x 2 0.55 -0.55 x 1 2.2 -2.2 ref. KDE DET 0 0.1 0.2 0.3 0.4 (a1) x 2 0.55 -0.55 x 1 2.2 -2.2 ref. KDE DET 0 0.1 0.2 0.3 0.4 (b1) x 2 0.55 -0.55 x 1 2.2 -2.2 ref. KDE DET 0 0.1 0.2 0.3 0.4 (a2) x 2 0.55 -0.55 x 1 2.2 -2.2 ref. KDE DET 0 0.1 0.2 0.3 0.4 0.5 0.6 (b2) Fig. 13 PDF estimates resulting from adaptive KDE and size-split DET metho ds with (a) constant and (b) linear DEs based on particle ensembles with (1) n = 10 4 and (2) n ≈ 10 6 . 5 samples are compared with the reference PDF (17). viously discussed ellipse case. While the MISE error deca y rates remain similar, the MISE of the DET v ari- an ts are larger than KDE for small n and surpass KDE for n b et w een 10 4 and 10 5 . In addition to the previous MISE-based p erformance analyses, we hav e ev aluated the accuracy of the DET estimator with Hellinger (e.g., W ang and W ang 2015) and total v ariation distances (Shorack 2000, p. 543). These metrics are based on diﬀerences of square ro ots of densities and absolute density diﬀerences, respectively , and therefore p enalize strong deviations less than the MISE. Since o ccasional outliers are more likely in the DET estimator and regular small noise is an issue in the adaptiv e KDE method (see for example Fig. 5(b2)), the DET estimator was found to p erform b etter than KDE in terms of Hellinger-distance metric as opp osed to MISE (see exemplary comparison in Fig. 2, panels (a) and (d)). In terms of total v ariation distance, the rel- ativ e p erformance among the estimators and conv er- Distribution Element T rees 17 10 2 10 4 10 6 n 10 -3 10 -2 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 10 1 10 2 10 3 mean number of DEs m lin. size const. size lin. score const. score n 0.125,0.25,0.5,1 (b) Fig. 14 Estimation of the bi-v ariate uniform PDF (17). (a) Evolutions of the MISE as a function of the n umber of samples n for DET estimators (red solid and blue dashed) with equal size (symbols) and score splits (no symbols) are plotted. The MISE resulting from the adaptive KDE is included as well (black dash dot). F or the DET estimators the resulting mean num b er of DEs h m i are given in panel (b). Po wer law scalings with exponents indicated in the ﬁgure legends are depicted (gray thick solid). gence trends were found to b e o verall similar to the MISE results. 3.3 Comparisons Histogram and T ree-Based Metho ds In the previous tw o sections, we found that the lin- ear DET v arian t is most accurate with little diﬀerence b et w een size- and score-based splitting. Accordingly , w e pro ceed in this and the next sections by fo cusing on linear DETs with size-based splits. In Fig. 17, a summary of MISE decay curves resulting from the his- togram, LL-OPT, density tree, and linear DET estima- tors for all one- and t wo-dimensional examples is pro- vided. Due to the rapidly growing computational costs asso ciated with the density-tree estimator, w e stop af- ter n = 31623 ≈ 10 4 . 5 . The computing time for one densit y-tree estimate of, e.g., the b eta PDF example with 10 5 samples to ok around 10 minutes. Fig. 17 is accompanied by exemplary PDF estimates included in Fig. 18. These estimates resulted from ensembles with n = 3163 samples. One general observ ation from the MISE curv es in Fig. 17 is the go od accuracy of the DET estimator. While histograms with bin widths determined by the normal reference rule are reasonably accurate in the Gaussian mixture cases, they are inaccurate in the other cases. The densit y-tree estimator leads to mixed re- sults and is most accurate for the examples inv olving uniform distributions (see Fig. 18(b)). A MISE conv er- gence that comes quite close to the DET estimator is resulting from the LL-OPT method. Notable is ﬁnally a reduction in MISE deca y rates when going from one to tw o dimensions as seen in Fig. 17. This is an indi- cation of the curse of dimensionalit y men tioned in the in tro duction. 3.4 F our- and Seven-Dimensional Examples Next, w e assess the p erformance of the linear DET metho d with equal-size splits together with the other estimators in examples inv olving four and seven sample- space dimensions. 3.4.1 High-Dimensional Gaussian PDF In a ﬁrst multi-dimensional example, we reuse the joint Gaussian PDF (16) with d -dimensional probabilit y space x = ( x 1 , x 2 , . . . , x d ) > , mean v ector µ b eing zero, and d × d cov ariance matrix C . In a ﬁrst step, we insp ect the performance of the diﬀeren t methods for d = 4 with the randomly c hosen cov ariance matrix C =     1 − 0 . 344 0 . 141 − 0 . 486 − 0 . 344 1 0 . 586 0 . 244 0 . 141 0 . 586 1 − 0 . 544 − 0 . 486 0 . 244 − 0 . 544 1     . In Fig. 19(a), MISE decays from the diﬀerent estima- tors are compared. KDE p erforms b est, follow ed b y the DET estimator, whic h shares the same empirically de- termined con v ergence rate of appro ximately 1 / √ n . The other estimators perform similarly to histograms. The data series of the density tree and KDE methods stop at n ≈ 10 4 . 5 and 10 5 . 5 , resp ectively , due to the rapidly 18 Daniel W. Meyer x 2 1.1 -0.1 x 1 1.1 -0.1 ref. KDE DET 0 5 10 (a) x 2 1.1 -0.1 x 1 1.1 -0.1 ref. KDE DET 0 5 10 (b) Fig. 15 PDF estimates resulting from adaptive KDE and the size-split linear DET metho d based on particle ensembles with (a) n = 10 4 and (b) n ≈ 10 6 . 5 samples are compared with the reference PDF (18). gro wing computing times of these estimators (see dis- cussion in the in tro duction 1). An analysis of computing times p er densit y estimate is provided in Fig. 19(b). The reported times are com- prised of the construction of the estimator and density queries based on the estimator. The num b er of query p oin ts in the MC in tegration is kept constan t for diﬀer- en t estimators and num ber of samples n . This explains the plateaus for small n in the LL-OPT, density tree, and histogram estimators, where the computing times for small n are gov erned by the query eﬀort. F or increas- ing n , the estimator construction becomes noticeable, leading to growing times. In the KDE, the query cost is connected to the n um ber of samples, which leads to a con tinuous increase in computing time. The relativ ely small growth in computing time of the DET estimator is based on tw o factors document in Fig. 20: (panel a) the query time, whic h is driven b y the logarithmically increasing tree depth, gro ws slo wly compared to (b) the tree construction time, which dep ends on an exp onen- tial but sublinear growth in DEs h m i . The construction time is plotted in Fig. 19(b) and shown to conv erge to the total DET computing time, as the query time b ecomes comparably small for large n . In a next step, the dimensionality is increased to sev en with the cov ariance matrix C = Distribution Element T rees 19 10 2 10 4 10 6 n 10 -2 10 -1 10 0 MISE lin. size const. size lin. score const. score KDE n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 5 10 15 20 mean tree depth n t lin. size const. size lin. score const. score log(n 0.5,1,2,4 ) (b) Fig. 16 Estimation of Dirichlet PDF (18). See Fig. 6.        1 − 0 . 216 0 . 161 − 0 . 0496 0 . 0342 − 0 . 116 0 . 749 − 0 . 216 1 0 . 301 0 . 0391 − 0 . 217 0 . 0189 − 0 . 381 0 . 161 0 . 301 1 0 . 574 − 0 . 312 0 . 109 0 . 386 − 0 . 0496 0 . 0391 0 . 574 1 − 0 . 438 0 . 730 − 0 . 0572 0 . 0342 − 0 . 217 − 0 . 312 − 0 . 438 1 − 0 . 475 0 . 258 − 0 . 116 0 . 0189 0 . 109 0 . 730 − 0 . 475 1 − 0 . 386 0 . 749 − 0 . 381 0 . 386 − 0 . 0572 0 . 258 − 0 . 386 1        again chosen randomly . The corresp onding MISE and computing time curves for this example are provided in Fig. 21. Compared to the previous four-dimensional case, the MISE con vergence rate sho wn in panel (a) has reduced to n − 1 / 4 for the DET estimator (curse of di- mensionalit y). Unlike in the previous case, the DET es- timator b ecomes more accurate than adaptiv e KDE for n > 5000. While the LL-OPT estimator has a similar MISE con v ergence rate for large n like the DET estima- tor, the density tree and histogram MISE are similar, lik e in the previous four-dimensional case. The com- puting times rep orted in Fig. 21(b) show similar trends as well. Unlike in the four-dimensional case, how ev er, w e can identify the growth originating from the tree construction pro cess at large n more clearly . Moreov er, the computing times hav e increased noticeably com- pared to d = 4, whic h is do cumen ted as w ell in Fig. 20, where mean tree depth and num ber of DEs are higher for larger dimensionalit y d . 3.4.2 High-Dimensional Dirichlet PDF A d -dimensional generalization of the Diric hlet PDF (18) is giv en by p ( x ) = Γ  P d +1 i =1 α i  Q d +1 i =1 Γ ( α i ) d Y i =1 x α i − 1 i 1 − d X i =1 x i ! α d +1 − 1 (19) ∀ x i ≥ 0 and P d i =1 x i ≤ 1 with parameter v ector α = ( α 1 , . . . , α d +1 ) ha ving comp onen ts α i > 0. Again t wo examples of diﬀeren t dimensionality are considered, that is a four-dimensional case with α = (6 . 13 , 9 . 29 , 10 . 6 , 8 . 24 , 3 . 91) > and a sev en-dimensional case with α = (9 , 5 . 71 , 8 . 96 , 4 . 51 , 5 . 81 , 4 . 06 , 10 . 7 , 1 . 51) > . Both parameter vectors were chosen randomly . MISE con vergence curves for both cases are plotted in Fig. 22. Except for somewhat smaller asymptotic deca y rates, the observ ations from the previous Gaussian examples carry o v er to this case. This holds also true for the com- puting times of the diﬀerent estimators (not shown) that display similar dep endencies on n as seen in the previous section. Finally , by insp ecting the mean tree depth and num b er of DEs in Fig. 20, we can observ e that they gro w again approximately logarithmically and sublinearly with a dep endence on the dimension. 3.5 Dep endence on T est Parameters and T est Statistic All previous results w ere obtained with DET metho ds that were based on χ 2 test statistics b oth for go odness- of-ﬁt and indep endence with identical signiﬁcance levels α g = α d = 0 . 001, resp ectiv ely . T o insp ect the inﬂuence of the c hoice of the test statistic on one hand, we p er- formed DET computations with KS and Kendall’s τ tests for go o dness-of-ﬁt and independence, respectively , while keeping the signiﬁcance levels unc hanged. On the other hand, we v aried the signiﬁcance level α = α g = α d b y factors of 0.1 and 10, while main taining the χ 2 test statistics. No signiﬁcant changes in the DET estimator p erformance was found with these v ariations as is for example sho wn in Fig. 23, where MISE results for the one-dimensional b eta and seven-dimensional Dirichlet cases are pro vided. 20 Daniel W. Meyer 10 2 10 4 10 6 10 -3 10 -2 MISE 1d asymmetric claw histogram LL-OPT dens. tree DET n -0.25,-0.5,-1 10 2 10 4 10 6 10 -2 1d kurtotic unimodal 10 2 10 4 10 6 10 -2 1d outlier 10 2 10 4 10 6 10 -2 10 0 MISE 1d spiky uniforms 10 2 10 4 10 6 10 -2 1d beta 10 2 10 4 10 6 10 -4 10 -3 10 -2 1d gamma 10 2 10 4 10 6 n 10 -2 10 -1 MISE 2d uniform ellipse 10 2 10 4 10 6 n 10 -3 10 -2 2d Gaussian 10 2 10 4 10 6 n 10 -2 10 0 2d Dirichlet Fig. 17 MISE decay as a function of the num ber of samples n for the indicated one- and tw o-dimensional cases. Results from (blue solid) histogram, (magen ta dashed) LL-OPT, (blac k thic k solid) density tree, and (red symbols) size-split linear DET estimators are shown together with (gray thick solid) p o wer-la w scalings n − 0 . 25 , n − 0 . 5 , and n − 1 . -3 -2 -1 0 1 2 3 x 0 0.1 0.2 0.3 0.4 0.5 p(x) histogram LL-OPT dens. tree ref. (a) 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 x 0 50 100 150 200 250 300 350 p(x) histogram LL-OPT dens. tree ref. (b) Fig. 18 PDF estimates resulting from (blue thin solid) histogram, (magenta dashed) LL-OPT, and (black thick solid) density tree estimators with n = 3163 samples are compared with (black dash dot) the reference PDFs (12) and (13). In panels (a) and (b), the asymmetric claw (12) and spiky uniforms (13) PDFs, resp ectively , are depicted. Distribution Element T rees 21 10 2 10 4 10 6 n 10 -3 10 -2 MISE histogram LL-OPT dens. tree DET KDE n -0.125,-0.25,-0.5 (a) 10 2 10 4 10 6 n 10 -2 10 0 10 2 average computation time [s] histogram LL-OPT dens. tree DET KDE DET t.c. (b) Fig. 19 Estimation of the four-dimensional join t Gaussian PDF. (a) Evolutions of the MISE and (b) computing time as a function of the num ber of samples n for histograms (blue solid), LL-OPT (dashed magenta), density tree (black thick solid), and DET (red solid symbols) estimators, and adaptive KDE (black dash dot) are plotted. In panel (b), the tree construction time for the DET estimator (red dashed symbol) is plotted as well. 10 2 10 4 10 6 n 0 5 10 15 20 25 30 35 mean tree depth n t Gauss 4d Gauss 7d Dirichlet 4d Dirichlet 7d (a) 10 2 10 4 10 6 n 10 0 10 5 mean number of DEs m Gauss 4d Gauss 7d Dirichlet 4d Dirichlet 7d (b) Fig. 20 Mean tree depths h n t i (a) and mean n um b er of DEs h m i (b) of the DET estimator in the examples in v olving the (lines) four- and (symbols) seven-dimensional (blue dashed) Gaussian and (red solid) Dirichlet PDFs. 10 2 10 4 10 6 n 4 6 8 10 12 14 16 MISE 10 -4 histogram LL-OPT dens. tree DET KDE n -0.0625,-0.125,-0.25 (a) 10 2 10 4 10 6 n 10 -2 10 0 10 2 average computation time [s] histogram LL-OPT dens. tree DET KDE DET t.c. (b) Fig. 21 Estimation of the seven-dimensional joint Gaussian PDF. See Fig. 19. 22 Daniel W. Meyer 10 2 10 4 10 6 n 10 2 MISE histogram LL-OPT dens. tree DET KDE n -0.125,-0.25,-0.5 (a) 10 2 10 4 10 6 n 6 8 10 12 MISE 10 5 histogram LL-OPT dens. tree DET KDE n -0.0625,-0.125,-0.25 (b) Fig. 22 Estimation of the (a) four- and (b) seven-dimensional Dirichlet PDFs. Evolutions of the MISE as a function of the n umber of samples n are plotted. See Fig. 19. 10 2 10 4 10 6 n 10 -4 10 -3 10 -2 MISE 2 = 10 -3 2 = 10 -2 2 = 10 -4 KS n -0.25,-0.5,-1 (a) 10 2 10 4 10 6 n 6 8 10 12 MISE 10 5 2 = 10 -3 2 = 10 -2 2 = 10 -4 KS+Kendall's n -0.0625,-0.125,-0.25 (b) Fig. 23 Estimations of (a) beta PDF (14) and (b) seven-dimensional Diric hlet PDF (19). Ev olutions of the MISE as a function of the num ber of samples n for linear DET estimators with equal size splits and χ 2 test statistics with signiﬁcance level (red solid symbol) α = 0 . 001, (blue dashed symbol) α = 0 . 01, (blue dashed) α = 0 . 0001, and (red solid) KS and Kendall’s τ test statistics with α = 0 . 001 are plotted. Po wer law and logarithmic scalings with exponents indicated in the ﬁgure legends are depicted (gray thick solid). By v arying, how ev er, the signiﬁcance level α , a de- p endence of the mean num ber of DEs is exp ected as is seen in Fig. 24. While an increase in α leads to an in- crease in the n um b er of elemen ts, more elemen ts do not necessarily translate into higher accuracy or reduced MISE. This is due to a bias/v ariance trade-oﬀ, where as α is increased bias/statistical errors de-/increase, re- sp ectiv ely , and vice versa. As a result, the total error or MISE do es not change signiﬁcantly . 4 Concluding Remarks The DET estimator provides an analytical density rep- resen tation based on piecewise constan t, linear, and p ossibly higher-order functions. This representation can b e eﬃcien tly assem bled and ev aluated at arbitrary prob- abilit y space p ositions. Therefore, the adaptivit y of the DET metho d is not limited by a smallest scale that limits for example the p erformance of a grid-based esti- mation metho d. Unlik e adaptiv e KDE, DET-based es- timates provide limited diﬀeren tiabilit y , whic h may b e a disadv antage in some applications. Moreov er, unlike com bined with a principal axes transform common in KDE (e.g., Silverman 1998, equation (4.7)), the DET estimator is not in v ariant to the orien tation of the co or- dinate system. A corresponding combination w as found to enable an increase in the computational eﬃciency and accuracy of the DET metho d. Distribution Element T rees 23 10 2 10 4 10 6 n 10 0 10 1 mean number of DEs m 2 = 10 -3 2 = 10 -2 2 = 10 -4 KS n 0.125,0.25,0.5,1 (a) 10 2 10 4 10 6 n 10 1 10 2 10 3 10 4 mean number of DEs m 2 = 10 -3 2 = 10 -2 2 = 10 -4 KS+Kendall's n -0.0625,-0.125,-0.25 (b) Fig. 24 Estimations of (a) b eta PDF (14) and (b) sev en-dimensional Dirichlet PDF (19). Ev olutions of the mean num b er of DEs h m i are depicted. See Fig. 23. In terms of accuracy , the linear DET metho d show ed, for the diﬀerent examples considered, similar or better MISE con vergence rates compared with adaptiv e KDE. The latter is, b esides the DET metho d, the most accu- rate estimator in this study . Constan t DEs are adv anta- geous only in cases, where samples stem from piecewise constan t PDFs, as they lead to smaller oscillations at PDF discontin uities. The LL-OPT estimator follows in terms of accuracy after the linear DET estimator. Con- v entional histograms are least accurate, but at the same time computationally inexp ensiv e as long as the dimen- sion is not to o high. Densit y trees w ere found to pro vide mixed results in terms of accuracy . Computing times of the DET method were found to scale in terms of the n umber of samples fav orably com- pared to the other estimators. In the DET metho d, the n umber of DEs, whic h determines the time for the DET construction, w as found to grow sublinearly . In the case of score-based splitting, there is a theoretical linear up- p er b ound to the DE growth. The splitting metho d for the DET construction was found to hav e little eﬀect on the num b er of DEs, but aﬀects the tree depth. How ev er, irresp ectiv e of the metho d applied, the mean tree depth scales approximately logarithmically with the num ber of samples. The computational cost of a densit y ev alua- tion at a sp eciﬁc p oint is determined by the tree depth. By v arying the test parameters that guide the splitting pro cess, i.e., signiﬁcance levels and test statistics, no signiﬁcan t change in the DET p erformance w as found. In conclusion, our new DET metho d is a go o d can- didate for a computationally eﬃcien t general-purp ose- densit y-estimator. The matlab implementation of the DET metho d that w as applied in this study is av ailable for download from the MathW orks File Exc hange (tag ‘distribution elemen t tree’). An imp ortant element that allow ed us to break the exp onen tial gro wth of the bin count with respect to the dimension d is the approximation of (m utual) indep en- dence by pairwise independence. The relation b et ween pairwise and mutual independence is a sub ject of ongo- ing research (e.g., Nelsen and Ubeda-Flores 2012). Be- sides this imp ortan t asp ect, to further the DET method, it would b e in teresting to insp ect the p erformance of higher order DEs and more adv anced go o dness-of-ﬁt and statistical indep endence tests. A Deriv ation of MMSE Slop e Estimator W riting without loss of generalit y the linear marginal PDF (4) in a simpler form with x i ∈ [0 , 1] and the subscripts skipp ed, w e obtain p ( x | θ ) =  x − 1 2  θ + 1 . By calculating the mean of random v ariable X based on this PDF we obtain h X i = 1 12 (6 + θ ) and therefore, can express the slop e parameter in terms of this mean as θ = 6(2 h X i − 1) . (20) In the case of a ﬁnite ensem ble, w e estimate the mean with h X i n = 1 n P n j =1 x j and the slop e by ˆ θ = 6 c (2 h X i n − 1) . Here, c is a correction factor that is determined b y minimizing the mean square error (MSE) expressed as h ( ˆ θ − θ ) 2 i = *" 6 c 2 n n X j =1 x j − 1 ! − θ # 2 + = * 36 c 2 2 n n X j =1 x j − 1 ! 2 − 12 c 2 n n X j =1 x j − 1 ! θ + θ 2 + = 36 c 2 * 4 n 2 n X j =1 n X k =1 x j x k − 4 n n X j =1 x j + 1 + 24 Daniel W. Meyer − 12 c (2 h X i − 1) θ + θ 2 = 36 c 2 4 n 2 * n X j =1 n X k =1 x j x k + − 4 h X i + 1 ! − 12 c (2 h X i − 1) θ + θ 2 = 36 c 2    4 n 2 * n X j =1 n X k =1 k 6 = j x j x k + n X j =1 x 2 j + − 4 h X i + 1    − 12 c (2 h X i − 1) θ + θ 2 = 36 c 2  4( n − 1) n h X i 2 + 4 n h X 2 i − 4 h X i + 1  − 12 c (2 h X i − 1) θ + θ 2 . T o determine the minimum MSE, we set d d c h ( ˆ θ − θ ) 2 i = 72 c  4( n − 1) n h X i 2 + 4 n h X 2 i − 4 h X i + 1  − 12(2 h X i − 1) θ = 0 , which leads for the correction factor to c = (2 h X i − 1) nθ 6[4( n − 1) h X i 2 + 4 h X 2 i − 4 n h X i + n ] (21) = (2 h X i − 1) nθ 6(4 n h X i 2 − 4 h X i 2 + 4 h X 2 i − 4 n h X i + n ]) = (2 h X i − 1) nθ 6[ n (2 h X i − 1) 2 + 4 h X 0 2 i ] = 6 n (2 h X i − 1) 2 6[ n (2 h X i − 1) 2 + 4 h X 0 2 i ] . F or n → ∞ the correction factor c go es to one. References Achilleos A, Delaigle A (2012) Lo cal bandwidth selectors for decon volution kernel densit y estimation. Statistics and Computing 22(2):563–577 Bagnato L, Punzo A, Nicolis O (2012) The auto dep endogram: a graphical device to in v estigate serial dependences. Jour- nal of Time Series Analysis 33(2):233–254 Botev Z (2007) Spectral implemen tation of adap- tiv e kernel density estimator via diﬀusion. h ttps://ch.math w orks.com/matlab cen tral/ﬁleexchange/14034- k ernel-density-estimator, accessed: 01/09/2016 Botev Z (2016) Implementation of adaptive k ernel den- sity estimator for high dimensions via diﬀusion. h ttps://ch.math w orks.com/matlab cen tral/ﬁleexchange/58312- k ernel-density-estimator-for-high-dimensions, accessed: 10/01/2017 Botev ZI, Grotowski JF, Kro ese DP (2010) Kernel den- sity estimation via diﬀusion. The Annals of Statistics 38(5):2916–2957 Breiman L, F riedman J, Stone CJ, Olshen R (1984) Classiﬁcation and regression trees. W adsworth Statis- tics/Probabilit y , Chapman and Hall/CRC Cao R, Cuev as A, Gonzalez Manteiga W (1994) A com- parativ e study of several smo othing metho ds in density estimation. Computational Statistics and Data Analysis 17(2):153–176 Cochran WG (1952) The chi square test of go o dness of ﬁt. The Annals of Mathematical Statistics 23(3):315–345 Curtin RR, Cline JR, Slagle NP , March WB, Ram P , Mehta NA, Gray AG (2013) Mlpack: A scalable c++ machine learning library . Journal of Machine Learning Research 14:801–805 F erguson TS (1973) A ba yesian analysis of some nonparamet- ric problems. The Annals of Statistics 1(2):209–230 Fix E, Hodges J (1951) Discriminatory analysis, nonparamet- ric estimation: consistency prop erties. Rep ort 4, Pro ject No. 21-49-004, USAF School of Aviation Medicine Haerdle W, W erwatz A, Mueller M, Sp erlic h S (2004) Non- parametric and Semiparametric Mo dels, 1st edn. Springer Series in Statistics, Springer-V erlag Berlin Heidelb erg Jiang H, Mu JC, Y ang K, Du C, Lu L, W ong WH (2016) Computational asp ects of optional P´ olya tree. Journal of Computational and Graphical Statistics 25(1):301–320 Jing J, Ko c h I, Naito K (2012) P olynomial histograms for multiv ariate density and mo de estimation. Scandinavian Journal of Statistics 39(1):75–96 Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for densit y estimation. Journal of the American Statistical Asso ciation 91(433):401–407 Kendall MG (1938) A new measure of rank correlation. Biometrik a 30(1-2):81–93 Kogure A (1987) Asymptotically optimal cells for a histor- gram. The Annals of Statistics 15(3):1023–1030 Koop erb erg C, Stone CJ (1991) A study of logspline density estimation. Computational Statistics and Data Analysis 12(3):327–347 Loftsgaarden DO, Quesenberry CP (1965) A nonparametric estimate of a multiv ariate density function. The Annals of Mathematical Statistics 36(3):1049–1051 Ma L, W ong WH (2011) Coupling optional p olya trees and the tw o sample problem. Journal of the American Statis- tical Asso ciation 106(496):1553–1565 Mann HB, W ald A (1942) On the c hoice of the num b er of class interv als in the application of the chi square test. The Annals of Mathematical Statistics 13(3):306–317 Marron JS, W and MP (1992) Exact mean integrated squared error. The Annals of Statistics 20(2):712–736 Neal RM (2000) Mark o v c hain sampling methods for diric hlet proces s mixture mo dels. Journal of Computational and Graphical Statistics 9(2):249–265 Nelsen RB, Ub eda-Flores M (2012) How close are pairwise and m utual indep endence? Statistics and Probability Let- ters 82(10):1823–1828 O’Brien T A, Kashinath K, Ca v anaugh NR, Collins WD, O’Brien JP (2016) A fast and ob jectiv e multidimen- sional kernel density estimation method: fastkde. Com- putational Statistics and Data Analysis 101:148–160 P ap oulis A (1991) Probability , Random V ariables, and Stochastic Pro cesses, 3rd edn. McGra w-Hill Series in Electrical Engineering, McGraw-Hill, Inc., New Y ork P ark B, T urlach B (1992) Practical p erformance of sev- eral data driven bandwidth selectors. Rep ort, Universite catholique de Louv ain, Center for Op erations Research and Econometrics (CORE) P ark BU, Marron JS (1990) Comparison of data-driven band- width selectors. Journal of the American Statistical As- soc iation 85(409):66–72 P earson K (1900) On the criterion that a giv en system of devi- ations from the probable in the case of c orrelated system of v ariables is such that it can b e reasonably supp osed to ha ve arisen from random sampling. Philosophical Maga- zine 50:157–175 P etersen A, Muller HG (2016) F unctional data analysis for density functions by transformation to a hilb ert space. The Annals of Statistics 44(1):183–218 Ram P , Gray AG (2011) Density estimation trees. In: Pro- ceedings of the 17th ACM SIGKDD International Con- ference on Knowledge Discov ery and Data Mining, Asso- Distribution Element T rees 25 ciation for Computing Machinery , pp 627–635 Rosen blatt M (1956) Remarks on some nonparametric esti- mates of a density function. The Annals of Mathematical Statistics 27(3):832–837 Scott DW (2015) Multiv ariate density estimation theory , practice, and visualization, second edition edn. Wiley se- ries in probability and statistics, Wiley , Hob ok en Scott DW, Sagae M (1997) Adaptive density estimation with massive data sets. In: Proceedings of the Statistical Com- puting Section, ASA, American Statistical Association, pp 104–108 Shampine LF (2008a) Matlab program for quadrature in 2d. Applied Mathematics and Computation 202(1):266–274 Shampine LF (2008b) V ectorized adaptive quadrature in mat- lab. Journal of Computational and Applied Mathematics 211(2):131–140 Sheather SJ (2004) Density estimation. Statistical Science 19(4):588–597 Shorac k GR (2000) Probability for Statisticians. Springer T exts in Statistics, Springer, Berlin Silverman BW (1998) Density estimation for statistics and data analysis. Monographs on statistics and applied prob- ability , Chapman and Hall, CRC, Bo ca Raton Smirno v N (1948) T able for estimating the go odness of ﬁt of empirical distributions. The Annals of Mathematical Statistics 19(2):279–281 Srip erum budur B, F ukumizu K, Kumar R, Gretton A, Hy- v aerinen A (2013) Density estimation in inﬁnite dimen- sional exp onential families. arXiv:150904348v2 p 42 Steele M, Chaseling J (2006) Po w ers of discrete go o dness-of- ﬁt test statistics for a uniform null against a selection of alternativ e distributions. Communications in Statistics - Simulation and Computation 35(4):1067–1075 W ang X, W ang Y (2015) Nonparametric multiv ariate den- sity estimation using mixtures. Statistics and Computing 25(2):349–364 W ong WH, Ma L (2010) Optional p olya tree and bay esian inference. The Annals of Statistics 38(3):1433–1459 Y ang Y (2008) Penalized semiparametric density estimation. Statistics and Computing 19(4):355 Zaunders J, Jing J, Leipold M, Maeck er H, Kelleher AD, Koch I (2016) Computationally eﬃcien t m ultidimensional anal- ysis of complex ﬂow cytometry data using second order p olynomial histograms. Cytometry Part A 89(1):44–58

Density Estimation with Distribution Element Trees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment