Generalized Mode and Ridge Estimation

Generalized Mode and Ridge Estimation Y en-Chi Chen Department of Statistics Carnegie Mellon Uni versity yenchic@andrew.cmu.edu Christopher R. Genovese Department of Statistics Carnegie Mellon Uni versity genovese@stat.cmu.edu Larry W asserman Department of Statistics Carnegie Mellon Uni versity larry@stat.cmu.edu Abstract The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the proba- bility of ﬁnding a galaxy times the mean brightness of the galaxy . W e propose a method for studying the geometric structure of generalized densities. In particular, we sho w ho w to ﬁnd the modes and ridges of a generalized density function using a modiﬁcation of the mean shift algorithm and its variant, subspace constrained mean shift. Our method can be used to perform clustering and to calculate a mea- sure of connectivity between clusters. W e establish consistency and rates of con- ver gence for our estimator and apply the methods to data from two astronomical problems. 1 Introduction Consider a random sample of the form ( X 1 , Y 1 ) , . . . ( X n , Y n ) , where X 1 , · · · , X n ∈ R d is a random sample from a smooth density p and each Y i is a scalar random variable. The gener - alized density function (GDF) , also known as an intensity function, is f ( x ) = µ ( x ) p ( x ) where µ ( x ) = E ( Y | X = x ) . A kernel estimate of the GDF is b f n ( x ) = 1 nh d n X i =1 Y i K  x − X i h  , (1) where K is a kernel function and h is the smoothing bandwidth. For simplicity , we henceforth assume that the function K is a Gaussian kernel. The generalized density function and estimator are of interest in problems where the additional information in the Y ’ s measure the importance, relev ance, or intensity of the corresponding points. (In the language of point processes, the X ’ s are the points and the Y ’ s are the marks.) T wo common cases include: 1. Y i represents a covariate associated with each point, and the generalized density weights the points according to the value of the covariate. For example, in galactic astronomy , the X ’ s might represent a galaxy’ s location and the Y ’ s the galaxy’ s mass. Astronomers are interested in the “mass-density” f , which describes the distribution of galaxy mass. 2. Y i represents the measurement precision for each observation, so the generalized density weights the points according to how precisely the y are measured. 1 (a) Modes and ridges (b) Contour plot Figure 1: An example for a smooth function with its modes (red) and ridge (blue). This paper focuses on estimating the modes and ridg es of the GDF because (i) they are often features of direct scientiﬁc interest, (ii) they provide useful and descriptive summaries of the GDF’ structure, and (iii) they can be used as inputs to clustering. Giv en a smooth function f : R d 7→ R , the modes (local maximums) and ridges [ 5 , 15 , 8 ] are deﬁned as M = Mo de ( f ) = { x ∈ R d : ∇ f ( x ) = 0 , λ 1 ( x ) < 0 } R = Ridge ( f ) = { x ∈ R d : V ( x ) V ( x ) T ∇ f ( x ) = 0 , λ 2 ( x ) < 0 } , (2) where λ 1 ( x ) ≥ λ 2 ( x ) ≥ · · · ≥ λ d ( x ) is the eigen values of ∇∇ f ( x ) , the Hessian matrix of f , and V ( x ) = [ v 2 , · · · , v d ( x )] is a d × ( d − 1) matrix with v k ( x ) being eigen vectors of ∇∇ f ( x ) corresponding to eigen value λ k ( x ) . The problem of estimating modes and ridges for density functions has been considered in the litera- ture.ma For instance, [ 16 , 14 , 1 ] de velop estimators for local modes, and [ 8 ] develop estimators for density ridges and establish the asymptotic properties of these estimators. The analogous problem for generalized density functions has not yet been considered. W e estimate the modes and ridges by c M n = Mo de ( b f n ) , b R n = Ridge ( b f n ) . (3) It is well-kno wn the local modes and ridges from the kernel density estimator can be estimated efﬁciently by the mean shift algorithm [ 7 , 3 , 4 ]. Here, we present a modiﬁcation of the mean shift algorithm that can ﬁnd the modes and ridges in b f n . 2 Methods 2.1 W eighted Mean Shift Before we proceed to our method, we ﬁrst revie w the usual mean shift algorithm. Given data X 1 , · · · , X n ∈ R d and an initial point x ∈ R d , the mean shift algorithm [ 7 , 3 , 4 ] updates x to x ← − P n i =1 X i K  x − X i h  P n i =1 K  x − X i h  . (4) If we keep iterating, we end up at a mode of the kernel density estimator b p n ( x ) = 1 nh d P n i =1 K  x − X i h  . W e deﬁne the weighted mean shift , which generates a path starting from a point x by successi ve updates of the form x ← − P n i =1 Y i X i K  x − X i h  P n i =1 Y i K  x − X i h  . (5) 2 This is directly analogous to the ordinary mean-shift update but puts additional weight Y i on the point X i . The path generated from each point x eventually con verges to a local mode of the gen- eralized density estimate b f n , an element of the set c M n . Later we will see that this is a consistent estimator of M . Now we deri ve the relation between ( 1 ) and ( 5 ). The gradient of b f n ( x ) is ∇ b f n ( x ) = 1 nh d n X i =1 Y i ∇ K  x − X i h  = 1 nh d +2 n X i =1 Y i ( X i − x ) K  x − X i h  = 1 nh d +2 n X i =1 Y i X i K  x − X i h  ! − x nh d +2 n X i =1 Y i K  x − X i h  ! . (6) Note that we use the fact that ∇ K ( x ) = − xK ( x ) for the Gaussian kernel. After rearrangement, x + m ( x ) = P n i =1 Y i X i K  x − X i h  P n i =1 Y i K  x − X i h  , (7) where m ( x ) = nh d +2 P n i =1 Y i K  x − X i h  × ∇ b f n ( x ) (8) is the mean shift vector that is always pointing to ward the direction of gradient. Thus, the update rule ( 5 ) is to move x to x + m ( x ) which follo ws the gradient ascent. By Morse theory , for a smooth function with non-degenerate Hessian matrix, the gradient ascent path will con verge to one of the local modes. Accordingly , the update rule ( 5 ) ends up at one of the local modes. 2.2 W eight Subspace Constrained Mean Shift [ 15 ] proposes the subspace-constrained mean-shift algorithm which can be used to ﬁnd the ridges of b f n ( x ) . An analogous modiﬁcation also works in the weighted case to ﬁnd the ridges of the generalized density function. This update rule, which we call the weighted subspace constrained mean shift algorithm, is given by x ← − x + V ( x ) V ( x ) T m ( x ) , (9) where m ( x ) is the mean shift vector deﬁned in ( 8 ) and V ( x ) = [ v 2 ( x ) , · · · , v d ( x )] with v k ( x ) being the eigen vector corresponding to the k -th eigen value (ﬁrst is the largest) of the estimated Hessian matrix ∇∇ b f n ( x ) . The Hessian matrix is ∇∇ b f n ( x ) = 1 nh d +4 n X i =1  ( X i − x )( X i − x ) T − h 2 I d  Y i K  x − X i h  , (10) where I d is the d × d identity matrix. In practice, we can ignore the factor 1 nh d +4 since it is just a scaling. This algorithm will push every point x along a ‘projected gradient path’ V ( x ) V ( x ) T m ( x ) until it arriv es a point on b R n . 3 A pplications 3.1 Mode Clustering A common application of the mean shift algorithm is to perform a type of clustering called mode clustering [ 12 ]. The clusters are deﬁned as the sets of points whose mean-shift paths conv erge to the 3 same local mode. Using the weighted mean-shift algorithm yields a mode clustering based on the generalized density estimate. Let f ( x ) be a smooth intensity function. For an y point x , we deﬁne a gradient path φ x ( t ) as follows φ x : R + 7→ R d , φ x (0) = x, φ 0 x ( t ) = ∇ f ( φ x ( t )) . (11) W e denote dest ( x ) = lim t →∞ φ x ( t ) as the destination of φ x . By Morse theory [ 10 , 17 ], dest ( x ) must be one of the local modes of f ( x ) except the case x is in a set of Lebesque measure 0 (including saddle points and local minimums). Let M = { M 1 , · · · , M k } be the collection of local modes of f . W e deﬁne the cluster of M j by C j = { x ∈ R d : dest ( x ) = M j } . (12) In practice, we cluster data points by their destination of the weighted mean shift ( 5 ). 3.2 Connectivity Measur e for Clusters Some clusters are fairly isolated while other are close together . Here we show ho w to measure how close clusters are by deﬁning a notion of connectivity . The idea is that the mean shift iterations can be thought of particles moving according to a dif fusion (Markov chain). W e deﬁne a diffusion as follo ws. The probability of jumping from x to X i is P ( x → X i ) = Y i K  x − X i h  P n i =1 Y i K  x − X i h  . (13) This deﬁnes a diffusion and we denote Q ( x ) be the random variable of the abov e dif fusion. i .e. P ( Q ( x ) = X i ) = P ( x → X i ) . Now E ( Q ( x ) | X 1 , Y 1 , · · · , X n , Y n ) = P n i =1 X i Y i K  x − X i h  P n i =1 Y i K  x − X i h  (14) which is the update rule ( 5 ). The same result holds for non-weighted mean shift. Thus, the mean shift can be viewed as a e xpectation for a certain diffusion. Let c M 1 , · · · , c M k be the local modes of b f n ( x ) . Motiv ated by ( 13 ), we deﬁne P ( X i → X j ) = Y j K  x − X j h  P n i =1 Y i K  x − X i h  + P k j =1 W j K  x − c M j h  P ( X i → c M j ) = W j K  x − M j h  P n i =1 Y i K  x − c M j h  + P k j =1 W j K  x − c M j h  , (15) where W j = b f n ( c M j ) /  1 nh d P n i =1 K  c M j − X i h  . Note that we impute the weight W j for each local mode c M j by an estimate of m ( c M j ) . W e also deﬁne each mode to be an absorbing state. Namely , the transition probability to/from each local mode to itself is 1 . Let P be a transition matrix with k + n states such that the ﬁrst k states are the estimated local modes and the latter n states are the data points. Then the transition matrix P can be factorized by P =  I k 0 S T  , T ij = P ( X i → X j ) , S ij = P ( X i → c M j ) , (16) and I k is the k × k identity matrix. 4 Then the matrix A = ( I n − T ) − 1 S is the absorbing matrix; that is, the absorbing probability from X i to the local mode M j is A ij . W e deﬁne the connectivity for the two clusters corresponding to local modes M i , M j as Ω ij = 1 2 P X l ∈ D i Y l A lj P X l ∈ D i Y l + P X l ∈ D j Y l A li P X l ∈ D j Y l ! , (17) where D l is the data points belonging to cluster l . The interpretation of Ω ij is as follows. Ω ij is the av erage hitting probability from points in cluster i that end up at mode j ﬁrst, and vice versa. Connectivity will be large when two clusters are close and the boundary between them has high density . If we think of the (hard) cluster assignments as class labels, the connectivity is analogous to the mis-classiﬁcation rate between class i and class j . 4 Statistical Analysis The modes and ridges of b f n are estimates of the modes and ridges of f . In this section we study the statistical properties of these estimators. Let BC k be the set of bounded, k times continuously differentiable functions. Let k A k max be the max norm of a vector or a matrix A . For a smooth function f , we deﬁne the follo wing operators. A vector α = ( α 1 , . . . , α d ) of non-negati ve integers is called a multi-index with | α | = α 1 + α 2 + · · · + α d and corresponding deriv ati ve operator D α = ∂ α 1 ∂ x α 1 1 · · · ∂ α d ∂ x α d d , where D α f is often written as f ( α ) . Namely , each f ( α ) ( x ) is a | α | -th order partial deri vati ve of f . For j = 0 , . . . , 3 , deﬁne the following norms associated with deri vati ves k f k ( j ) ∞ = max α : | α | = j sup x ∈ R d | f ( α ) ( x ) | , k f k ∗ ∞ ,k = max j =0 , ··· ,k k f k ( j ) ∞ . (18) Note that for two functions f 1 , f 2 , k f 1 − f 2 k ∗ ∞ ,k is a norm that measures the differences between f 1 , f 2 to the k -th order differentiation. General Assumptions. (A1) The random var iable Y is uniformly bounded by a constant C Y < ∞ . (A2) The function f ∈ BC 3 and is a Morse function (the Hessian matrix is non-singular at all critical values). (K1) The kernel function K ∈ BC 3 and is symmetric, non-negati ve and Z x 2 K ( α ) ( x ) dx < ∞ , Z  K ( α ) ( x )  2 dx < ∞ for all | α | = 0 , 1 , 2 , 3 . (K2) The kernel function satisﬁes condition K 1 of [ 9 ]. That is, there exists some A, v > 0 such that for all 0 <  < 1 , sup Q N ( K , L 2 ( Q ) , C K  ) ≤  A   v , where N ( T , d,  ) is the  − cov ering number for a semi-metric space ( T , d ) and K = ( u 7→ K ( α )  x − u h  : x ∈ R d , h > 0 , | α | = 0 , 1 , 2 , 3 ) . Assumptions (A1-2) are mild re gularity conditions. Condition (K1) is common for a kernel function and (K2) is the smoothness condition for the kernel function. In particular, the Gaussian kernel and any smooth kernel with compact support satisﬁes (K1-2). 5 4.1 Risk of b f n Deﬁne the mean integrated square errors (MISE) by MISE k ( b f n ) = E Z X | α | = k    b f ( α ) n ( x ) − f ( α ) ( x )    2 dx, (19) for k = 0 , 1 , 2 , 3 . Note that as k = 0 we obtain the usual MISE for b f n . This is just an extension to higher order deriv ati ves. Theorem 1 Assume (A1-2) and (K1). Then MISE k ( b f n ) = O  h 4  + O  1 nh d +2 k  , k = 0 , 1 , 2 , 3 . [ 2 ] prov es the above Theorem for usual k ernel density estimation. W e omit the proof as it is similar to their proof. W e also hav e the following uniform bound. Theorem 2 F or a smooth function f , let k f k ∗ ∞ ,k be deﬁned as the above. Assume (A1-2) and (K1-2). Then k b f n − f k ∗ ∞ ,k = O  h 2  + O P r log n nh d +2 k ! , k = 0 , 1 , 2 , 3 . The proof is essentially the same as [ 9 , 6 ] by noting that the random variable Y is bounded by C Y . Similar results in kernel density estimation can be seen in [ 9 , 6 ]. 4.2 Mode Estimation and Ridge Recovery In this section we assume that the density is supported on a compact subset K ⊂ R d . For two sets A, B , the Hausdorff distance is giv en by d H ( A, B ) = inf { r : A ⊂ B ⊕ r, B ⊂ A ⊕ r } , (20) where A ⊕ r = { x : d ( x, A ) ≤ r } and d ( x, A ) = inf y ∈ A || x − y || 2 . Theorem 3 Let M , c M n be the collection of local modes of f and b f n r espectively . Assume (A1-2), (K1-2) and (M) ther e exists λ 0 , λ 1 , δ > 0 suc h that { x : λ 1 ( x ) < − λ 1 , k∇ f ( x ) k 2 < λ 0 } ⊂ M ⊕ δ, where λ 1 ( x ) is the ﬁrst eigen value to the Hessian matrix of f ( x ) . When k b f n − f k ∗ ∞ , 2 is sufﬁciently small, d H  c M n , M  = O  h 2  + O P r 1 nh d +2 ! . The proof of Theorem 3 is in the supplementary material. Here we present an outline for the proof. P R O O F O U T L I N E . Let c M 1 , · · · , c M b k be the estimated local modes. Note that the number of es- timated modes might be different from the number of true modes. Howe ver , by assumption (M) 6 and the fact k b f n − f k ∗ ∞ , 2 is sufﬁciently small, we hav e c M n ⊂ M ⊕ δ and b k = k . This follows from the bounds on eigen values of Hessian matrix and the estimated gradient. Moreover , we cannot place two estimated modes near any true mode M i since there is always a saddle point between two (estimated) modes and this cannot happen due to assumption (M) on the ﬁrst eigen value. After rearranging the indices, each c M i is close to M i for all i = 1 , · · · , k . Note that ∇ b f n ( c M i ) = 0 so that T aylor’ s theorem implies ∇ b f n ( M i ) = ∇ b f n ( M i ) − ∇ b f n ( c M i ) = ∇∇ b f n ( M ∗ i )( M i − c M i ) , where M ∗ i is a point between c M i , M i . Thus, M i − c M i =  ∇∇ b f n ( M ∗ i )  − 1 ∇ b f n ( M i ) . By similar technique for proving rates of con ver gence of the kernel density estimator, we have ∇ b f n ( M i ) = O ( h 2 ) + O P  q 1 nh d +2  . And      ∇∇ b f n ( M ∗ i )  − 1     max is uniformly bounded when k b f n − f k ∗ ∞ , 2 is small. W e conclude k c M i − M i k 2 = O ( h 2 ) + O P  q 1 nh d +2  .  The above theorem shows that ev ery local mode of the intensity function f ( x ) will be estimated by the weighted mean shift algorithm as long as k b f n − f k ∗ ∞ , 2 is sufﬁciently small. By Theorem 2 , k b f n − f k ∗ ∞ , 2 is con ver ging to 0 so that this result ev entually holds. The rate comes from the bias- variance decomposition. Note that the variance is O P  q 1 nh d +2  since the modes are deﬁned via the gradient and this is the variance in estimating gradients. Theorem 4 Let R , b R n be the ridges of f and b f n r espectively . Assume (A1-2), (K1-2) and the following two conditions: (R1) Ther e e xists β , δ > 0 and b 1 < b 2 such that b 1 < 0 and β = b 2 − b 1 and for all x ∈ R ⊕ δ , λ 1 ( x ) > b 2 , λ 2 ( x ) < b 1 , k∇ f ( x ) k 2 max | α | =3 | f ( α ) ( x ) | < β 2 2 √ d . (R2) Ther e exists G 0 , G 1 > 0 such that { x : k V ( x ) V ( x ) T ∇ f ( x ) k max ≤ G 0 , λ 2 ( x ) < G 1 } ⊂ R ⊕ δ , wher e δ is deﬁned in (R1). When k b f n − f k ∗ ∞ , 3 is sufﬁciently small, d H  b R n , R  = O  h 2  + O P r log n nh d +4 ! . P R O O F O U T L I N E . The proof is essentially the same as in [ 8 ]. In particular , our condition (A1-2) together with (R1) implies the conditions in [ 8 ]. W e assumed an additional condition (R2) so that, when k b f n − f k ∗ ∞ , 2 is sufﬁciently small, the ridge estimator b R n ⊂ R ⊕ δ . This is because b f n and its ﬁrst, second deriv ativ es will be bounded near f with high probability and thus the eigenv alue and projected gradient V ( x ) V ( x ) T ∇ f ( x ) for the estimator will be similar to the truth. Hence, by Theorem 6 and 7 in [ 8 ], the result follows.  5 A pplications W e apply our method to tw o astronomy problems. The detailed description of the data can be found in the supplementary material. 7 (a) Original data (b) Modes and ridges Figure 2: The SDSS data. Each black dot is a galaxy . The size of the dot is in proportional to the mass of the galaxy . The ﬁrst problem analyzes data from the Sloan Digit Sky Survey (SDSS), where each data point is a galaxy at a speciﬁc location. The original data set is three-dimensional, but to ease visualization, we take a thin slice of the data to transform it into a two-dimensional problem. For each galaxy , we also hav e information on the galaxy’ s total luminosity , which can be transformed into a proxy for the galaxy’ s mass. W e weight each galaxy by the mass and estimate the mass-density ﬁeld and ﬁnd the modes (red crosses) and ridges (blue curves). In cosmology , the modes correspond to galaxy clusters and the ridges correspond to ﬁlaments. Figure 2 shows the results of our analysis. The second problem is to identify galaxies in a single image containing sev eral galaxies. The data set is a noisy image of four ov erlapping galaxies. W e ﬁrst transform the image into gray scale and con volv e with a Gaussian kernel with bandwidth h = 4 to reconstruct the GDF . W e focus only the region with intensity lar ger than 0 . 15 . W e apply the weighted mean shift algorithm and perform clustering to segment the data and construct the connecti vity measure for clusters. Figure 3 shows the results. W e successfully detect the four galaxies and the weighted mean shift clustering classiﬁes regions belonging to each galaxy . The connectivity matrix seems to capture the lev el of “interaction” among the four galaxies. For instance, clusters 1 and 2 ha ve high connectivity , reﬂecting their large o verlap. In contrast, clusters 2 and 4 ha ve only moderate connecti vity . 6 Conclusion In this paper, we generalized the mode and ridge estimation from densities to generalized densities that can account for weighting or “marks. ” W e have established conv ergence rate for estimating modes and ridges in this case. Future work will be focused on constructing conﬁdence sets for the estimated modes and ridges. 8 (a) Original image (b) Modes and Clustering 1 2 3 4 1 – 0.25 0.10 0.10 2 0.25 – 0.13 0.21 3 0.10 0.13 – 0.14 4 0.10 0.21 0.14 – (c) Matrix of connectivity measure Figure 3: The galaxy merger . A Proof f or Theorem 3 P R O O F . The proof consists of two steps. At the ﬁrst step, we prove that the number of M , c M are the same and each element of M correspond to a close element in c M . The second step is to prove the rate of con ver gence. W e assume k b f n − f k ∗ ∞ , 2 ≤ min  λ 1 2 d , λ 0 2  . (21) (this is what we mean k b f n − f k ∗ ∞ , 2 is suf ﬁciently small) Note that by W eyl’ s theorem (Theorem 4.3.1 in [ 11 ]), ( 21 ) implies that the eigen value dif ference between ∇∇ f ( x ) and its estimator ∇∇ b f n ( x ) is bounded by 1 2 λ 1 . Thus, eigen values of ∇∇ b f n ( x ) within M ⊕ δ is upper bounded by − 1 2 λ 1 . W e will use this fact later . Step 1. Let c M 1 , · · · , c M b k be the estimated local modes. Note that the number of estimated modes might be different from the number of true modes. Howe ver , by assumption (M) and ( 21 ), we hav e c M n ⊂ M ⊕ δ . This can be proved by contradiction since any local mode of b f n must hav e negati ve ﬁrst eigen value and zero gradient. Equation ( 21 ) and assumption (M) force these points to be within M ⊕ δ . Moreov er , we have b k = k . Note that we cannot place two estimated modes near any true mode M i since there is always a saddle point between two (estimated) modes and this cannot happen due to the assumption (M) on the ﬁrst eigen value (saddle points ha ve positi ve ﬁrst eigen value). Step 2. This part of proof is similar to [ 16 ]; howe ver , our problem is simpler than theirs since we only need to ﬁnd the rate of con ver gence while they prove the limiting distributions. By similar technique, we can prov e the limiting distribution as well. After rearranging the indices, each c M i is close to M i for all i = 1 , · · · , k . Note that ∇ b f n ( c M i ) = 0 so that T aylor’ s theorem implies ∇ b f n ( M i ) = ∇ b f n ( M i ) − ∇ b f n ( c M i ) = ∇∇ b f n ( M ∗ i )( M i − c M i ) , 9 where M ∗ i is a point between c M i , M i . Thus, M i − c M i =  ∇∇ b f n ( M ∗ i )  − 1 ∇ b f n ( M i ) . The next step is to pro ve E  k M i − c M i k 2  = O ( h 2 ) V ar  k M i − c M i k 2  = O  1 nh d +2  . W e bound M i − c M i by the following: M i − c M i =  ∇∇ b f n ( M ∗ i )  − 1 ∇ b f n ( M i ) ≤      ∇∇ b f n ( M ∗ i )  − 1     2    ∇ b f n ( M i )    2 . W e ﬁrst bound      ∇∇ b f n ( M ∗ i )  − 1     2 . Note that the L 2 matrix norm is the largest absoluted eigen- value; thus, all we need to do is to bound the eigen values of  ∇∇ b f n ( M ∗ i )  − 1 . Since M ∗ i is a point between M i , c M i , M ∗ i ∈ M ⊕ δ . Consequently , all eigenv alues of ∇∇ b f n ( M ∗ i ) must be less or equal to − 1 2 λ 1 by assumption (M) and ( 21 ). Therefore, the eigen values of  ∇∇ b f n ( M ∗ i )  − 1 must be bounded by 2 λ 1 . This giv es the bound on the L 2 norm. Now we ﬁnd the rate of    ∇ b f n ( M i )    2 . Since ∇ f ( M i ) = 0 , E  ∇ b f n ( M i )  = E  ∇ b f n ( M i )  − ∇ f ( M i ) is the bias of b f n ( M i ) . By assumptions (A1-2) and (K1), the bias is at rate O ( h 2 ) by the same way for ﬁnd the rate of bias of the kernel density estimator . Similarly , the covariance matrix is at rate O  1 nh d +2  . Hence, by multidimensional Chebeshev’ s inequality , ∇ b f n ( M i ) − E  ∇ b f n ( M i )  = O P r 1 nh d +2 ! which implies ∇ b f n ( M i ) = O ( h 2 ) + O  q 1 nh d +2  . Putting altogether, k M i − c M i k 2 = O ( h 2 ) + O P  q 1 nh d +2  . By repeating step 2 for each mode, we conclude the result.  B Proof f or Theorem 4 P R O O F . The proof is essentially the same as in proving Theorem 6 in [ 8 ]. In particular , our condition (A1-2) together with (R1) implies all the conditions in [ 8 ]. Note that they pro ve that the set ˜ R ≡ b R n ∩ R ⊕ δ is close to R at rate d H ( ˜ R , R ) = O  k b f n − f k ∗ ∞ , 2  as k b f n − f k ∗ ∞ , 3 is sufﬁciently small. 10 Here we prove that our additional assumption (R2) implies that ˜ R = b R . Let x ∈ b R . Then we hav e b V n ( x ) b V n ( x ) T ∇ b f n ( x ) = 0 . Thus, k V ( x ) V ( x ) T ∇ f ( x ) k max = k V ( x ) V ( x ) T ∇ f ( x ) − b V n ( x ) b V n ( x ) T ∇ b f n ( x ) k max ≤ k  V ( x ) V ( x ) T − b V n ( x ) b V n ( x ) T  ∇ f ( x ) k max + k b V n ( x ) b V n ( x ) T  ∇ f ( x ) − ∇ b f n ( x )  k max ≤ k V ( x ) V ( x ) T − b V n ( x ) b V n ( x ) T k max C 1 + k∇ f ( x ) − ∇ b f n ( x ) k max C 2 where C 1 , C 2 are two constants that is independent of x . The existence of C 1 , C 2 comes from the fact that ∇ f ( x ) and b V n ( x ) b V n ( x ) T are uniformly bounded for all x ∈ R . Note that since b V n b V T n is a projection matrix, its max norm is uniformly bounded by 1 . The ﬁrst term k V ( x ) V ( x ) T − b V n ( x ) b V n ( x ) T k max can be bounded by Da vis-Kahan’ s theorem [ 13 ] at rate k b f n − f k ∗ ∞ , 2 and the second term is at rate k b f n − f k ∗ ∞ , 1 . Accordingly , as k b f n − f k ∗ ∞ , 3 is small, any point x ∈ b R satisﬁes k V ( x ) V ( x ) T ∇ f ( x ) k max ≤ G 0 , λ 2 ( x ) < G 1 . Thus, b R ⊂ R ⊕ δ so that ˜ R = b R . Now by Theorem 6 in [ 8 ], d H ( b R , R ) = O  k b f n − f k ∗ ∞ , 2  and by Theorem 2 , we conclude the result.  C Prepr ocessing and Description of The Data C.1 The SDSS Data The data in the Sloan Digit Sky Survey (SDSS) can be found in http://www.sdss3.org/ . W e use the Main Sample Galaxies (MSGs) spectroscopic data in the data release 9. Each observ ation is a galaxy conatining the follo wing four features: 1. RA = right ascension (i.e., longitude, in degrees [0 to 360]) 2. DEC = declination (i.e., latitude, in degrees [-90 to 90]) 3. z = spectroscopic redshift (precisely estimated proxies for distance) 4. r = Petrosian magnitude, r band (for completeness only) The ﬁrst three features (RA, DEC, z) relates to the spatial position of the galaxy and the Petrosian magnitude is a luminosity measure associated with the mass. W e select a thin slice of the univ erse to analyze: RA ∈ [155 , 185] D E C ∈ [35 , 65] z ∈ [0 . 110 , 0 . 115] . The mass of a galaxy is giv en by the following formula ( http://www.sdss.org/DR7/ tutorials/getdata/index.html ): MASS = r − 5 log (4 . 28 × 10 8 × z ) , where r is the Petrosian magnitude and z is the redshift. W e use the MASS as Y and the spatial location X = ( RA, D E C ) to conduct our analysis. Note that since we pick a thin slice, we can neglect the dimension of z . 11 C.2 The Galaxy Merger The galaxy merger image is taken from the galaxy zoo ( http://quenchtalk.galaxyzoo. org/#/subjects/AGS000007y ) with label ‘ A GS000007y’. For the image data, the statistical model is gi ven by Y i = g ( X i ) +  i , i = 1 , · · · , n, where  i ’ s are IID mean 0 noises and X i is from an uniform grid. Since X i is on an uniform grid, the density function is constant, the function g is the same as the GDF . Then the estimator b g n ( x ) = 1 nh d n X i =1 Y i K  x − X i h  is an estimate to g ( x ) . And all the analysis for GDF can be applied to the image data. References [1] C. Abraham, G. Biau, and B. Cadre. Simple estimation of the mode of a multiv ariate density . The Canadian Journal of Statistics , 2003. [2] J. Chac ´ on, T . Duong, and M. W and. Asymptotics for general multi variate kernel density deriv a- tiv e estimators. Statistica Sinica , 2011. [3] Y . Cheng. Mean shift, mode seeking, and clustering. P attern Analysis and Machine Intelli- gence, IEEE T ransactions on , 17(8):790–799, 1995. [4] D. Comaniciu and P . Meer . Mean shift: a robust approach to ward feature space analysis. P attern Analysis and Mac hine Intellig ence, IEEE T r ansactions on , 24(5):603 –619, may 2002. ISSN 0162-8828. doi: 10.1109/34.1000236. [5] D. Eberly . Ridges in Imag e and Data Analysis . Springer , 1996. [6] U. Einmahl and D. M. Mason. Uniform in bandwidth consistency for kernel-type function estimators. The Annals of Statistics , 2005. [7] K. Fukunaga and L. D. Hostetler . The estimation of the gradient of a density function, with ap- plications in pattern recognition. IEEE T ransactions on Information Theory , 21:32–40, 1975. [8] C. R. Genov ese, M. Perone-Paciﬁco, I. V erdinelli, and L. W asserman. Nonparametric ridge estimation. , 2012. [9] E. Gine and A. Guillou. Rates of strong uniform consistency for multi variate kernel density estimators. In Annales de l’Institut Henri P oincare (B) Pr obability and Statistics , 2002. [10] M. A. Guest. Morse theory in the 1990’ s. arXiv:math/0104155v1 , 2001. [11] R. A. Horn and C. R. Johnson. Matrix Analysis . Cambridge, second edition, 2013. [12] J. Li, S. Ray , and B. Lindsay . A nonparametric statistical approach to clustering via mode identiﬁcation. Journal of Machine Learning Resear c h , 8(8):1687–1723, 2007. [13] V . Luxb urg. A tutorial on spectral clustering. Statistical and Computing , 2007. [14] E. Mammen, J. Marron, and N. Fisher . Some asymptotics for multimodality tests based on kernel density estimates. Pr obability Theory and Related F ields , 1992. [15] U. Ozertem and D. Erdogmus. Locally deﬁned principal curves and surfaces. Journal of Machine Learning Resear c h , 2011. [16] J. P . Romano. On weak conv ergence and optimality of kernel density estimates of the mode. The Annals of Statistics , 1988. [17] T . Sousbie. The persistent cosmic web and its ﬁlamentary structure – i. theory and implemen- tation. Mon. Not. R. Astron. Soc. , 2011. — 12

Generalized Mode and Ridge Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment