Local depth-based classification of directional data

Directional data arise in many applications where observations are naturally represented as unit vectors or as observations on the surface of a unit hypersphere. In this context, statistical depth functions provide a center--outward ordering of the d…

Authors: Giuseppe Gismondi, Rebecca Rivieccio, Giuseppe P

Local depth-based classification of directional data
Lo cal depth-based classification of directional data Giusepp e Gismondi ∗ , Reb ecca Rivieccio † , and Giusepp e Pandolfo ∗ ∗ Dept. of Ec onomics and Statistics, University of Naples F e deric o II, Nap oli, Italy † Dept. of Physics, University of Naples F e deric o II, Nap oli, Italy Abstract Directional data arise in man y applications where observ ations are naturally represented as unit vectors or as observ ations on the surface of a unit h yp ersphere. In this context, statistical depth functions provide a center–out ward ordering of the data. This w ork aims at prop osing the use of a lo cal notion of data depth function to be applied in the DD-plot (Depth vs. Depth plot) to classify directional data. The prop osed method is inv estigated through an extensive sim ulation study and tw o real-data examples. 1 In tro duction Directional data analysis is a branc h of statistics that is concerned with the exploration and mo delling of data expressed as angles or unit vectors. Suc h data lie on the surface of the unit h yp ersphere S q − 1 := { x ∈ R q : ∥ x ∥ 2 = 1 } of q − 1 dimensions, where || x || 2 := p P q i =1 x 2 i and x = ( x 1 , ..., x q ) . They naturally arise in a v ariety of real-w orld contexts where the v ectors represent directions, rotations, or cyclic phenomena. Prominent applications can b e found in geology , where the orientation of magnetic fields in rocks is studied, as well as in meteorology and psyc hology , in the analysis of wind directions or the p erception of the spatial orientation. F urther examples and theoretical dev elopments are discussed b y Mardia and Jupp ( 1999 ), whic h remains a fundamental reference in the field of directional statistics, later complemen ted by the more recen t work of Ley and V erdeb out ( 2017 ). There are several other application domains for which the orientation of v ectors in the space con tains ric her information than the magnitude, suc h as compositional data (i.e. when vectors consist of nonnegative comp onents that sum up to one) suc h as the relativ e frequencies of w ords in a document (see P andolfo and D’Ambrosio , 2021 ). As noted by Stephens ( 1982 ), applying a square-root transformation to eac h v ector maps these comp ositions to directional data lying on the surface of a ( q − 1) -dimensional unit hypersphere. When dealing with this t yp e of data, sev eral c hallenges arise due to the absence of a natural reference direction and the lack of a unique definition of orientation or sense of rotation. Moreo ver, since directional data do not ha v e a natural ordering, the dev elopment of suitable depth functions can b e quite useful. Indeed, depth functions provide a notion of centralit y , allowing a center– out ward ordering of lo cations on the manifold ( Agostinelli and Romanazzi , 2013 ) b y generalizing the univ ariate notions of median and rank to the m ultiv ariate setting. Sev eral notions of depth for directional data ha ve b een proposed and emplo y ed as feature spaces for implemen ting supervised classification metho ds ( Pandolfo and D’Am brosio , 2021 ; Dey and Jana , 2025 ). T raditional glob al angular depth functions aim to describe the ov erall centralit y of a p oin t with respect to the en tire data distribution, pro viding a single measure of how cen tral or p eripheral an observ ation is. Ho wev er, this approach is only reliable when dealing with unimodal and conv exly distributed data. In the case of multimodality or non-conv ex data structures, typically arising in mixture models or clustering problems, the global depths fail to pro vide a meaningful represen tation of cen trality , since m ultiple local centers may exist ( Painda v eine and V an Bever , 2013 ). T o ov ercome this limitation, local depth functions ha ve b een prop osed to pro vide a more refined assessmen t of cen trality . Specifically , they aim to ev aluate the p osition of a point within a restricted neigh b orho od of the data. This wa y it is possible to capture the cen trality at a sp ecific scale of 1 lo calit y ( Agostinelli and Romanazzi , 2011 ). Such approac h allows for a more flexible characteri- zation of complex data structures, making local depths particularly effective also for classification tasks. Hence, the goal of this w ork is to define a lo cal version of the cosine distance depth (CDD) prop osed by Pandolfo et al. ( 2018 ) to b e exploited for directional data classification purp oses. More specifically , we consider its application to the tw o-step pro cedure kno wn as DD-plot (Depth vs. Depth plot) in tro duced by Liu et al. ( 1999 ) and later used to p erform classification b y Li et al. ( 2012 ) (i.e. the DD-classifier). Roughly sp eaking, for t wo given samples, the corresp onding DD-plot represen ts the depth v alues of those sample p oin ts with resp ect to the tw o underlying distributions, and thus transforms the samples in an y dimension to a simple tw o-dimensional scatter plot. Then, a curve that b est separates the tw o samples in their DD-plot is applied, in the sense that the separation yields the smallest classification error in the DD space. The remainder of the pap er is organized as follo ws. Sections 2 and 3 briefly recall the concept of data depth for directional data and the classification via the Depth vs. Depth (DD)-plot. Section 4 in tro duces a notion lo cal cosine distance depth then used to built a classifier in tro duced in Section 5 . Section 6 presents an extensive simulation study to inv estigate the p erformances of the prop osed metho d. Section 7 provides tw o real-data examples. Finally , some concluding remarks are offered in Section 8 . 2 Data depths for directional data Statistical depth functions extend univ ariate ordering to higher dimensions. P articularly , they offer a cen ter-outw ard ordering b y pro viding a measure of ho w central a p oin t is with resp ect to a certain distribution. The concept of data depth was first extended to the analysis of directional data b y Small ( 1987 ) and later by Liu and Singh ( 1992 ). Accordingly , directional depth functions measure the degree of centralit y of a p oint with respect to a directional distribution, and they pro vide a cen ter-outw ard ordering on circles or on h yp erspheres. Within the literature w e can find the angular tuk ey depth (A TD) and the angular simplicial depth (ASD), whic h represent the directional extensions of the T ukey’s halfspace depth ( T ukey , 1975 ) and the simplicial depth originally introduced for data in R q , resp ectiv ely . Such depths are also kno wn as geometric depths because they are based on geometric stuctures (i.e. hemispheres and simplices). Because of that their main drawbac k is related to their high computational cost, whic h makes them unfeasible when q > 3 . F or such computational issue, here we focus on the class of distance-based depth functions introduced b y Pandolfo et al. ( 2018 ), which includes the arc distance depth (ADD) of Liu and Singh ( 1992 ), the c hord distance depth (ChDD) and the cosine distance depth (CDD). Suc h distance-based depths are computationally feasible ev en in high dimensions and are strictly p ositiv e ev erywhere on S q − 1 , except for in the uninteresting case of a point mass distribution, whereas ASD and A TD ma y tak e zero v alues, whic h can cause issues in sup ervised classification. In addition, they do not pro duce ties in the sample case. The computational adv an tage of CDD stems from the fact that it requires only pairwise inner pro ducts ⟨ x i , x j ⟩ , which can b e computed efficien tly even in high dimensions. This contrasts with geometric depths that require solving complex optimization problems ov er hemispheres or simplices. One more notion of depth for directional data is the angular Mahalanobis depth, which was studied by Ley et al. ( 2014 ) and developed b y using the concept of directional quantiles. Ho wev er, its application is often limited by the necessary prior c hoice of a spherical locational functional. Here w e focus on the CDD because of its computational ease and its prop erties that are par- ticularly useful in defining our prop osal. In the following, w e recall its definition. Definition 1 (Cosine Distance Depth) . The c osine distanc e depth of a p oint x ∈ S q − 1 with r esp e ct to the distribution F on S q − 1 is define d as: CDD( x, F ) := 2 − E F [ d cos ( x, W )] , wher e d cos ( x, w ) = 1 − ⟨ x, w ⟩ is the c osine distanc e, and W ∼ F . The sample version is obtained by replacing F by its empirical coun terpart ˆ F n calculated from the sample x 1 , . . . , x n . The CDD satisfies all the follo wing prop erties: P1. Rotation inv ariance: CDD( x, F ) = CDD( O x, O F ) for an y q × q orthogonal matrix O . 2 P2. Maximality at cen ter: max x ∈ S q − 1 CDD( x, F ) = CDD( x 0 , F ) for any F with cen ter at x 0 . P3. Monotonicit y on rays from the deepest p oin t: CDD( · , · ) decreases along any geo desic path t 7→ x t from the deep est point x 0 to its antipo dal point − x 0 . P4. Minimalit y at the antipo dal p oin t to the center: CDD( − x 0 , F ) = inf x ∈ S q − 1 CDD( x, F ) for an y F with cen ter at x 0 . While CDD pro vides a robust global measure of cen trality , it ma y not adequately capture lo cal structure in complex data distributions. In the next section, w e address this issue by introducing a lo cal v ersion of CDD that adapts to the scale of locality . 3 Classification in the depth space After the first suggestion in Liu and Singh ( 1992 ), the use of data depth to p erform sup ervised classification has b een suggested and in vestigated b y man y authors. T wo main approaches ha ve b een proposed in the literature: (i) the maxim um depth classifier and (ii) the Depth vs. Depth (DD)-classifier. The first simply assigns a p oin t x to the distribution (or group) with resp ect to it attains the highest depth v alue for any considered depth D( · , · ) , that is: D( x, ˆ F i ) > D( x, ˆ F j ) i  = j, ⇒ assign x to ˆ F i . where D( x, ˆ F i ) and D( x, ˆ F j ) are the empirical depths of x w.r.t. the i -th and the j -th distribution, resp ectiv ely . The latter was prop osed by Li et al. ( 2012 ) and is a refinemen t of the maximum depth classifier. It is based on the DD-plot (Depth vs. Depth plot), introduced b y Liu et al. ( 1999 ) whic h is a tw o- dimensional scatterplot where eac h data p oin t is represen ted with coordinates giv en b y its depth ev aluated with resp ect to tw o distributions. Then some classification rule s ( · ) is directly applied in the DD-space C s ( x ) = ( D( x, ˆ F i ) > s  D( x, ˆ F j )  ⇒ assign x to ˆ F i , D( x, ˆ F i ) ≤ s  D( x, ˆ F j )  ⇒ assign x to ˆ F j . where s ( · ) is a real increasing function. Li et al. ( 2012 ) suggested to lo ok for a p olynomial separator that is c hosen in order to minimize the empirical misclassification error rate on the training sample. Note that when s ( · ) is the iden tity function, the classification rule becomes the maxim um depth classifier. The maximum depth classifier is certainly quite intuitiv e and easy to implement. In addition, it can deal with classification problems inv olving a large num b er of groups. Conv ersely , the DD- classifier is more flexible, but it requires the degree of the p olynomial function for whic h the misclassification rate is minimized to b e searc hed for. The pro cedure can b e applied to an y kind of data, providing that a corresponding depth function exists. F or instance, DD-plot for functional data hav e b een developed ( Cuesta-Alb ertos et al. , 2017 ) and also to directional data by means of standard global depths ( P andolfo and D’Ambrosio , 2021 ; Demni et al. , 2019 ). Ob viously , an important asp ect to be considered regards the c hoice of the depth function. Indeed, from a classification p ersp ectiv e, it m ust b e noted that the halfspace and simplicial depths assign zero depth v alue to all those points which do es not belong to the conv ex h ull of the supp ort of distribution. This implies that sample p oints lying outside the conv ex hull of the training set ha ve zero empirical depth, th us it is not p ossible to assign an observ ation to one of the comp eting groups. On the contrary , this do es not o ccur by adopting distance-based depths since they are alw ays positive for an y data p oint in the sample space. While existing works hav e applied DD-classifiers with glob al depth functions, this approach ma y not capture local structure in complex directional distributions. In the next sections, we in tro duce a lo c al version of the cosine distance depth and develop a corresp onding DD-classifier that adapts to the scale of lo calit y . 3 4 Lo cal Cosine Distance Depth T o develop a notion of depth capable of describing local features and mode(s) in directional dis- tributions, Agostinelli and Romanazzi ( 2012 ) proposed a local v ersion of the AS D by constraining the size of the spherical simplices, while later on Pandolfo ( 2022 ) prop osed a local exten tion of the distance-based depths b y restricting a global distance-based depth measure by considering only the p oin ts within a certain distance to a giv en p oin t. Dra wing inspiration from the work of P aindav eine and V an Bev er ( 2013 ), w e prop ose a local v ersion of the cosine distance depth. This is deriv ed by calculating the depth with respect to the empirical distribution asso ciated with the sample obtained b y adding the reflections of the original observ ations with respect to a given point x to the original sample. W e provide the definition of this depth below and briefly discuss wh y this approach can be problematic or even fail altogether with other measures of angular depth. T o do that, we need first to define a symmetric reflector op erator on the unit hypersphere. Giv en a p oint x i ∈ S q − 1 , another p oin t x j ∈ S q − 1 can b e reflected symmetrically through x i as follo ws: R ( x j , x i ) = 2 x i ⟨ x i , x j ⟩ − x j , where ⟨ · , · ⟩ is the scalar pro duct b et ween the t wo vectors. This op erator has the follo wing prop- erties: 1. R ( − x i , x i ) = − x i , 2. R ( x j , x i ) = R ( x j , − x i ) , 3. Giv en a distance function d ( · , · ) defined on the unit hypersphere: d ( x i , R ( x j , x i )) = d ( x i , x j ) . Consider a sample X on S q − 1 and an y given point x i ∈ X . Let X − i := X \ { x i } denote the sample without x i . The set of all suc h reflected p oin ts is denoted b y R i := { R ( x j , x i ) : x j ∈ X − i } , and the augmented sample is X R i := X ∪ R i . Here w e assume that | X | = n , so | X − i | = n − 1 and | X R i | = 2 n − 1 . x i should b e the depth median of its o wn reflected region, ho wev er this is not alw ays true. When considering CDD, it is p ossible to iden tify a condition that precisely defines when this o ccurs. Prop osition 1. Given a p oint x i ∈ X ⊆ S q − 1 and the r efle cte d r e gion X R i , we have: x i = argmax x j ∈ X CDD  x j , X R i  if 1 + 2 X k  = i ⟨ x i , x k ⟩ > 0 argmax x k ∈ X − i d cos ( x i , x k ) = argmax x j ∈ X CDD  x j , X R i  if 1 + 2 X k  = i ⟨ x i , x k ⟩ < 0 X = argmax x j ∈ X CDD  x j , X R i  if 1 + 2 X k  = i ⟨ x i , x k ⟩ = 0 wher e d cos ( x, y ) = 1 − ⟨ x, y ⟩ . Henc e, x i is either a depth me dian or antip o dal to the depth me dian of the r e gion. Pr o of. Let X = { x 1 , . . . , x n } ⊂ S q − 1 . F or a fixed x i , define the reflected set through x i : X R i =  x R i k : x R i k = 2 x i ⟨ x i , x k ⟩ − x k , k  = i  . The CDD of a p oin t x j with resp ect to ( X, X R i ) is CDD  x j , X R i  = 2 − 1 2( n − 1)   X k  = j d cos ( x j , x k ) + X k  = i d cos  x j , x R i k    . 4 Maximizing CDD is equiv alen t to minimizing f ( x j ) = X k  = j d cos ( x j , x k ) + X k  = i d cos  x j , x R i k  . F or the first sum, since d cos ( x j , x j ) = 0 , X k  = j d cos ( x j , x k ) = n X k =1 [1 − ⟨ x j , x k ⟩ ] = n − ⟨ x j , S ⟩ , where S = P n k =1 x k . F or the second sum, using ⟨ x j , x R i k ⟩ = 2 ⟨ x i , x k ⟩⟨ x j , x i ⟩ − ⟨ x j , x k ⟩ , w e obtain d cos  x j , x R i k  = 1 − 2 ⟨ x i , x k ⟩⟨ x j , x i ⟩ + ⟨ x j , x k ⟩ . Summing ov er k  = i : X k  = i d cos  x j , x R i k  = ( n − 1) − 2 ⟨ x j , x i ⟩ X k  = i ⟨ x i , x k ⟩ + ⟨ x j , X k  = i x k ⟩ . Th us f ( x j ) =  n − ⟨ x j , S ⟩  +  ( n − 1) − 2 ⟨ x j , x i ⟩ A + ⟨ x j , X k  = i x k ⟩  , where A = P k  = i ⟨ x i , x k ⟩ . Since S = x i + P k  = i x k , we ha v e −⟨ x j , S ⟩ + ⟨ x j , X k  = i x k ⟩ = −⟨ x j , x i ⟩ . Therefore f ( x j ) = 2 n − 1 − ⟨ x j , x i ⟩ − 2 ⟨ x j , x i ⟩ A. Let v = ⟨ x j , x i ⟩ , and g ( v ) = f ( x j ) . Then g ( v ) = 2 n − 1 − v (1 + 2 A ) . Since g is linear in v , and v ∈ [ − 1 , 1] for unit vectors, the minimizer dep ends on the sign of C = 1 + 2 A : • If C > 0 : g is decreasing in v . Minim um o ccurs at the largest p ossible v in X , which is v = 1 attained at x j = x i . Hence x i is the unique minimizer of f ⇒ x i maximizes CDD. • If C < 0 : g is increasing in v . Minim um o ccurs at the smallest p ossible v in X . The smallest v is − 1 if − x i ∈ X ; otherwise it is min x j ∈ X ⟨ x j , x i ⟩ , whic h corresponds to max x j ∈ X d cos ( x i , x j ) . Hence the maximizer of d cos ( x i , · ) in X − i maximizes CDD. • If C = 0 : g is constan t w.r.t. v ⇒ all x j ∈ X give the same CDD ⇒ every point in X is a maximizer. In the case C < 0 , the farthest p oin t from x i in cosine distance is closest to the antipo de − x i . Th us, x i is either the depth median of X R i (when C > 0 ), or its an tipo de (when C < 0 ). This result has an intuitiv e in terpretation: when x i is cen trally located relativ e to other p oin ts (p ositiv e C ), it becomes the depth median of its reflected region. When x i is p eripheral (negativ e C ), its an tip ode becomes more cen tral in the reflected region. Note that it ma y happ en that a p oin t x i is not the depth median of its own reflected region. In particular, when a p oin t x i is an tip o dal to the depth median of X R i , then the p oints should b e reordered in an increasing order of depth. In suc h case, points with lo wer depth are more similar to x i according to prop ert y P3 . No w, we can define the depth-based neigh b ourho od of a certain p oint x i at level β by using an y angular depth measures AD( · , · ) . 5 Definition 2. ( β -Depth b ase d neighb ourho o d). The β -Depth neighb ourho o d of a p oint x i ∈ X ⊆ S q − 1 note d D N ( β ) i , with β ∈ (0 , 1] , is define d as the set of the first β ( n − 1) p oints in X − i r e or der e d in the fol lowing way: AD  x j , X R i  > AD  x k , X R i  > . . . > AD  x n , X R i  if x i = argmax x j AD  x j , X R i  ; AD  x j , X R i  < AD  x k , X R i  < . . . < AD  x n , X R i  if x i = argmin x j AD  x j , X R i  . Prop osition 2. Consider a p oint x i ∈ X ⊆ S q − 1 and the β -depth b ase d neighb ourho o d DN ( β ) i , and let C N i denote the set of the first β ( n − 1) p oints in X − i r e or der e d by incr e asing c osine distanc e fr om x i : d cos ( x j , x i ) < d cos ( x k , x i ) < . . . < d cos ( x n , x i ) , then if the CDD is adopte d as the angular depth me asur e, we have D N ( β ) i = C N i . Pr o of. F rom Prop osition 1 , w e know there are three cases dep ending on C = 1 + 2 P k  = i ⟨ x i , x k ⟩ : Case 1: C > 0 Here x i = argmax x k CDD( x k , X R i ) , so b y Definition 2 , points are ordered by decreasing CDD. W e need to show: CDD( x j , X R i ) > CDD( x l , X R i ) ⇐ ⇒ d cos ( x i , x j ) < d cos ( x i , x l ) . F rom the proof of Prop osition 1 , w e hav e: f ( x j ) = 2 n − 1 − ⟨ x i , x j ⟩ (1 + 2 A ) , where A = P k  = i ⟨ x i , x k ⟩ . Since maximizing CDD is equiv alent to minimizing f , we ha ve: CDD( x j , X R i ) > CDD( x l , X R i ) ⇐ ⇒ f ( x j ) < f ( x l ) . No w, f ( x l ) − f ( x j ) = − [ ⟨ x i , x l ⟩ − ⟨ x i , x j ⟩ ](1 + 2 A ) . Since C = 1 + 2 A > 0 in this case: f ( x l ) − f ( x j ) > 0 ⇐ ⇒ ⟨ x i , x l ⟩ − ⟨ x i , x j ⟩ < 0 ⇐ ⇒ ⟨ x i , x j ⟩ > ⟨ x i , x l ⟩ . Con verting to cosine distances: ⟨ x i , x j ⟩ > ⟨ x i , x l ⟩ ⇐ ⇒ 1 − d cos ( x i , x j ) > 1 − d cos ( x i , x l ) ⇐ ⇒ d cos ( x i , x j ) < d cos ( x i , x l ) . Th us, in Case 1, ordering by decreasing CDD is equiv alent to ordering b y increasing cosine distance from x i . Case 2: C < 0 Here x i = argmin x k CDD( x k , X R i ) , so b y Definition 2 , p oin ts are ordered b y increasing CDD. W e need to show: CDD( x j , X R i ) < CDD( x l , X R i ) ⇐ ⇒ d cos ( x i , x j ) < d cos ( x i , x l ) . Since f ( x j ) = 2 n − 1 − ⟨ x i , x j ⟩ (1 + 2 A ) and C = 1 + 2 A < 0 , f is increasing in v = ⟨ x i , x j ⟩ . Therefore: f ( x j ) < f ( x l ) ⇐ ⇒ ⟨ x i , x j ⟩ < ⟨ x i , x l ⟩ . Recall that maximizing CDD is equiv alent to minimizing f , so: CDD( x j , X R i ) > CDD( x l , X R i ) ⇐ ⇒ f ( x j ) < f ( x l ) ⇐ ⇒ ⟨ x i , x j ⟩ < ⟨ x i , x l ⟩ . 6 T aking the con trapositive: CDD( x j , X R i ) < CDD( x l , X R i ) ⇐ ⇒ ⟨ x i , x j ⟩ > ⟨ x i , x l ⟩ . Con verting to cosine distances: ⟨ x i , x j ⟩ > ⟨ x i , x l ⟩ ⇐ ⇒ 1 − d cos ( x i , x j ) > 1 − d cos ( x i , x l ) ⇐ ⇒ d cos ( x i , x j ) < d cos ( x i , x l ) . Th us, in Case 2, ordering by increasing CDD is also equiv alent to ordering b y increasing cosine distance from x i . Case 3: C = 0 Here all points hav e equal CDD, so an y ordering yields the same set D N ( β ) i , whic h equals C N i trivially . Therefore, in all cases, D N ( β ) i = C N i . Hence, defining the β -depth neigh b ourhoo d of a p oin t x i is equiv alent to looking for the β ( n − 1) nearest neigh b ours of x i . This result was already encountered b y Painda veine and V an Bever ( 2013 ) for depths in R q for q = 1 , while here it holds for q ≥ 1 as long as the CDD is considered. This allo ws us to compute the reflected region and the depth of a p oin t by lo oking for the nearest points. This wa y , the lo cal cosine distance-depth can b e defined as follows. Definition 3. (L o c al c osine distanc e depth). The lo c al c osine distanc e depth of a p oint x i ∈ X ⊆ S q − 1 at a lo c ality level β ∈ (0 , 1] is define d as fol lows: LCDD ( β ) ( x i , X ) = CDD  x i , D N ( β ) i  . When β is set equal to one, the global cosine distance depth is obtained. It is w orth noting that the chord or the arc distance depths do not guaran tee that a p oint x i is a depth median or an tip o dal to the depth median of X R i . This mak es reordering the p oin ts according to depth in a giv en region difficult, if p ossible at all. This o ccurs because c hord and arc distances lack the linear structure that makes the minimization problem in Prop osition 1 tractable. Specifically , the function analogous to f ( x j ) in the pro of do es not simplify to a linear function of ⟨ x i , x j ⟩ for these distances, making it imp ossible to derive a simple condition for when x i is the depth median of X R i . Note that since cosine distance and neighbourho ods are rotation-in v arian t, so to o are nearest neigh b ours and local cosine distance depth, i.e. LCDD ( β ) ( O x i , O X ) = LCDD ( β ) ( x i , X ) . LCDD is also bounded and strictly positive, i.e. 0 < LCDD ( β ) ( x i , X ) ≤ 2 for a non-point mass X , since neigh b ourho ods are finite and the CDD is strictly positive on S q − 1 . LCDD do es not satisfy the monotonicit y prop ert y ( P3 ). This is exp ected and desirable: it is designed to capture lo c al centralit y , and ma y exhibit m ultiple local maxima in m ultimo dal distributions. So, it is monotone in lo calit y . As β → 1 , LCDD ( β ) ( x ) → CDD ( x ) , reco vering the global depth. Figure 1 depicts the behaviour of the prop osed lo cal cosine distance compared with its global v ersion in the case of a trimodal spherical distribution, showing their con tour plots (for β = 0 . 25 ) alongside the densit y con tours of the data. F or the sake of illustration, the plots are t wo-dimensional and the data are reported in spherical co ordinates. As one can see, the CDD fail to capture the three mo des, iden tifying only a single global center. In contrast, the prop osed LCDD (with β = 0 . 25 ) clearly identifies all the three modes. Theorem 3 (Sample b eha viour in β -neigh b orho ods) . F or 0 < β 1 < β 2 ≤ 1 , 1. DN ( β 1 ) x ⊆ DN ( β 2 ) x 2. LCDD ( β 1 ) ( x, X ) ≥ LCDD ( β 2 ) ( x, X ) Pr o of. By Prop osition 2 , w e hav e DN ( β ) x = CN ( β ) x for all β . Let r (1) ( x ) ≤ r (2) ( x ) ≤ · · · ≤ r ( n − 1) ( x ) denote the ordered cosine distances from x to points in X \ { x } . Define k β = ⌊ β ( n − 1) ⌋ , the in teger part of β ( n − 1) . Since β 1 < β 2 , we ha v e k β 1 < k β 2 . 7 (a) (b) (c) Figure 1: Density con tour plot of a trimo dal distribution on the sphere (a), along with the corre- sp onding con tour plots CDD (b) and LCDD with β = 0 . 25 (c). By definition, CN ( β 1 ) x = { x j ∈ X \ { x } : r j ( x ) ≤ r ( k β 1 ) ( x ) } and CN ( β 2 ) x = { x j ∈ X \ { x } : r j ( x ) ≤ r ( k β 2 ) ( x ) } . Since r ( k β 1 ) ( x ) ≤ r ( k β 2 ) ( x ) , every point in CN ( β 1 ) x also satisfies r j ( x ) ≤ r ( k β 2 ) ( x ) , hence b elongs to CN ( β 2 ) x . Therefore, DN ( β 1 ) x = CN ( β 1 ) x ⊆ CN ( β 2 ) x = DN ( β 2 ) x . Recall the lo cal cosine depth for a p oin t x at level β is defined as LCDD ( β ) ( x, X ) = 2 − 1 k β k β X j =1 r ( j ) ( x ) . Let S β 1 = k β 1 X j =1 r ( j ) ( x ) and S β 2 = k β 2 X j =1 r ( j ) ( x ) . Since k β 2 > k β 1 , we can write S β 2 = S β 1 + ∆ S , where ∆ S = k β 2 X j = k β 1 +1 r ( j ) ( x ) . No w consider the a v erage distances: S β 1 k β 1 = 1 k β 1 k β 1 X j =1 r ( j ) ( x ) 8 and S β 2 k β 2 = 1 k β 2 ( S β 1 + ∆ S ) . Since r ( j ) ( x ) are non-decreasing, for an y j with k β 1 < j ≤ k β 2 , w e hav e r ( j ) ( x ) ≥ r ( k β 1 ) ( x ) . Moreo ver, r ( k β 1 ) ( x ) ≥ S β 1 k β 1 , b ecause the maxim um of the first k β 1 distances is at least their av erage. Therefore, ∆ S = k β 2 X j = k β 1 +1 r ( j ) ( x ) ≥ ( k β 2 − k β 1 ) · S β 1 k β 1 . This inequality implies S β 1 + ∆ S ≥ S β 1 + ( k β 2 − k β 1 ) · S β 1 k β 1 = S β 1  1 + k β 2 − k β 1 k β 1  = S β 1 · k β 2 k β 1 . Dividing by k β 2 giv es S β 2 k β 2 ≥ S β 1 k β 1 . Finally , since LCDD ( β ) ( x, X ) = 2 − S β k β , we obtain LCDD ( β 1 ) ( x, X ) = 2 − S β 1 k β 1 ≥ 2 − S β 2 k β 2 = LCDD ( β 2 ) ( x, X ) . This theorem establishes tw o key prop erties: (i) neigh b orho ods expand as β increases (nesting prop ert y), and (ii) local depth decreases as neighborho o ds expand. This is in tuitive: as w e consider larger neighborho ods, the a verage distance to neighbors increases, reducing the measure of lo cal cen trality . 4.1 P opulation v ersion and prop erties While Definition 3 pro vides a computationally tractable measure for finite samples, theoretical analysis requires its p opulation coun terpart. Definition 4 (Population Local Cosine Distance Depth) . L et F b e a distribution on S q − 1 with c ontinuous density f that is b ounde d away fr om zer o on its supp ort. F or x ∈ S q − 1 and β ∈ (0 , 1] , let ρ ( β ) ( x ) b e the unique r adius satisfying: F  { y ∈ S q − 1 : d cos ( x, y ) ≤ ρ ( β ) ( x ) }  = β The p opulation lo cal cosine distance depth of x with r esp e ct to F at lo c ality level β is define d as: LCDD ( β ) ( x, F ) := 2 − 1 β  E Y ∼ F h d cos ( x, Y ) · I { d cos ( x, Y ) ≤ ρ ( β ) ( x ) } i  Remark 1. Equivalently, LCDD ( β ) ( x, F ) = CDD ( x, F β ,x ) , wher e F β ,x denotes the c onditional distribution of F r estricte d to the ge o desic b al l of pr ob ability mass β ar ound x . This makes explicit the c onne ction b etwe en lo c al and glob al depth me asur es. Monotonicit y in β holds at the population level, as the conditional exp ectation of d cos is non- decreasing in the radius r ( β ) ( x ) . Prop osition 4 (Con tinuit y with resp ect to β ) . L et F b e a distribution on S q − 1 with c ontinuous density f b ounde d away fr om zer o on its supp ort. F or β ∈ (0 , 1) and ∆ β > 0 with β + ∆ β ≤ 1 , sup x ∈ S q − 1   LCDD ( β ) ( x, F ) − LCDD ( β +∆ β ) ( x, F )   = O (∆ β ) . 9 Pr o of. F or fixed x ∈ S q − 1 , let ρ ( β ) ( x ) b e the unique radius satisfying F  { y ∈ S q − 1 : d cos ( x, y ) ≤ ρ ( β ) ( x ) }  = β . Since f is con tinuous and bounded aw a y from zero, the quan tile function β 7→ ρ ( β ) ( x ) is Lipsc hitz con tinuous in β , uniformly in x . Define the conditional exp ectations: µ β ( x ) = E [ d cos ( x, Y ) | d cos ( x, Y ) ≤ ρ ( β ) ( x )] and µ β +∆ β ( x ) = E [ d cos ( x, Y ) | d cos ( x, Y ) ≤ ρ ( β +∆ β ) ( x )] . Let A x = { y : ρ ( β ) ( x ) < d cos ( x, y ) ≤ ρ ( β +∆ β ) ( x ) } , so that F ( A x ) = ∆ β . Then w e can write: µ β +∆ β ( x ) = β β + ∆ β µ β ( x ) + ∆ β β + ∆ β µ A ( x ) , where µ A ( x ) = E [ d cos ( x, Y ) | Y ∈ A x ] . Rearranging giv es: µ β +∆ β ( x ) − µ β ( x ) = ∆ β β + ∆ β ( µ A ( x ) − µ β ( x )) . No w, since ρ ( β ) ( x ) ≤ d cos ( x, y ) ≤ ρ ( β +∆ β ) ( x ) for y ∈ A x , we ha v e: µ A ( x ) ∈ [ ρ ( β ) ( x ) , ρ ( β +∆ β ) ( x )] . Also, by definition, µ β ( x ) ≤ ρ ( β ) ( x ) . Therefore, | µ A ( x ) − µ β ( x ) | ≤ ρ ( β +∆ β ) ( x ) − µ β ( x ) ≤ ρ ( β +∆ β ) ( x ) − 0 ≤ 2 , since d cos ∈ [0 , 2] . How ev er, we can obtain a tighter bound using the contin uity of ρ ( β ) ( x ) . F rom the Lipsc hitz con tinuit y of ρ ( β ) ( x ) in β , there exists L > 0 such that for all x : | ρ ( β +∆ β ) ( x ) − ρ ( β ) ( x ) | ≤ L ∆ β . Th us, | µ A ( x ) − µ β ( x ) | ≤ | ρ ( β +∆ β ) ( x ) − µ β ( x ) | ≤ | ρ ( β +∆ β ) ( x ) − ρ ( β ) ( x ) | + | ρ ( β ) ( x ) − µ β ( x ) | ≤ L ∆ β + | ρ ( β ) ( x ) − µ β ( x ) | . The term | ρ ( β ) ( x ) − µ β ( x ) | is bounded b y a constant M uniformly in x because f is b ounded a wa y from zero, ensuring the conditional distribution is not to o concentrated at the b oundary . Hence, | µ A ( x ) − µ β ( x ) | ≤ L ∆ β + M . Returning to the difference: | µ β +∆ β ( x ) − µ β ( x ) | = ∆ β β + ∆ β | µ A ( x ) − µ β ( x ) | ≤ ∆ β β ( L ∆ β + M ) = O (∆ β ) . Since LCDD ( β ) ( x, F ) = 2 − µ β ( x ) , we ha v e: | LCDD ( β ) ( x, F ) − LCDD ( β +∆ β ) ( x, F ) | = | µ β +∆ β ( x ) − µ β ( x ) | = O (∆ β ) . The b ound holds uniformly in x b ecause all constants ( L , M , and the implicit constant in O (∆ β ) ) are independent of x due to the compactness of S q − 1 and the uniform b ounds on f . Corollary 1 (Limit behavior) . L et F b e a distribution on S q − 1 with c ontinuous density f b ounde d away fr om zer o on its supp ort. Then for al l x ∈ S q − 1 , lim β → 0 + LCDD ( β ) ( x, F ) = 2 , lim β → 1 − LCDD ( β ) ( x, F ) = CDD( x, F ) , wher e CDD( x, F ) = 2 − E F [ d cos ( x, Y )] is the p opulation c osine distanc e depth. 10 Pr o of. Recall that LCDD ( β ) ( x, F ) = 2 − µ β ( x ) , where µ β ( x ) = E [ d cos ( x, Y ) | d cos ( x, Y ) ≤ ρ ( β ) ( x )] and ρ ( β ) ( x ) satisfies F { d cos ( x, Y ) ≤ ρ ( β ) ( x ) } = β . (i) As β → 0 + , the radius ρ ( β ) ( x ) → 0 b ecause the densit y f is bounded a w ay from zero. More precisely , since f is con tinuous and positive, for small β we ha ve ρ ( β ) ( x ) = O ( β ) . F or an y Y with d cos ( x, Y ) ≤ ρ ( β ) ( x ) , w e ha ve 0 ≤ d cos ( x, Y ) ≤ ρ ( β ) ( x ) → 0 . By dominated con vergence (since d cos ≤ 2 ), w e obtain: lim β → 0 + µ β ( x ) = lim β → 0 + E [ d cos ( x, Y ) | d cos ( x, Y ) ≤ ρ ( β ) ( x )] = 0 . Therefore, lim β → 0 + LCDD ( β ) ( x, F ) = 2 − 0 = 2 . (ii) As β → 1 − , the radius ρ ( β ) ( x ) increases tow ard ρ max ( x ) = inf { t ≥ 0 : F { d cos ( x, Y ) ≤ t } = 1 } , whic h is the smallest radius suc h that the ball of that radius around x con tains the en tire support of F . The conditional distribution giv en d cos ( x, Y ) ≤ ρ ( β ) ( x ) con v erges w eakly to the unconditional distribution F as β → 1 − . By the bounded conv ergence theorem (since d cos is b ounded), lim β → 1 − µ β ( x ) = E F [ d cos ( x, Y )] . Therefore, lim β → 1 − LCDD ( β ) ( x, F ) = 2 − E F [ d cos ( x, Y )] = CDD( x, F ) , where CDD( x, F ) is the p opulation version of the cosine distance depth defined as the exp ected v alue of 2 − d cos ( x, Y ) . These limits hav e intuitiv e interpretations: as β → 0 + , the neighborho o d shrinks to a p oin t, so the a verage distance goes to 0 and LCDD approaches its maxim um v alue 2. As β → 1 − , the neigh b orho od expands to cov er the entire distribution, reco vering the global CDD. Remark 2. Under appr opriate r e gularity c onditions (c ontinuity of F and b ounde dness away fr om zer o) the sample version (Definition 3 ) is a uniformly c onsistent estimator of the p opulation LCDD: as n → ∞ , the empiric al β ( n − 1) ne ar est neighb ors c onver ge to the p opulation ge o desic b al l c on- taining mass β , and the sample aver age of c osine distanc es c onver ges to the p opulation exp e ctation. The following lemma establishes that this con vergence is uniform o ver the h yp ersphere. Lemma 1 (Uniform consistency of LCDD) . L et F n denote the empiric al me asur e of a r andom sample X 1 , . . . , X n b e i.i.d. fr om a distribution F on S q − 1 with density f that is c ontinuous and b ounde d away fr om zer o on its supp ort. F or any β ∈ (0 , 1] , we have sup x ∈ S q − 1    LCDD ( β ) ( x, F n ) − LCDD ( β ) ( x, F )    a.s. − − → 0 as n → ∞ , wher e LCDD ( β ) ( x, F n ) denotes the sample LCDD b ase d on X n = { X 1 , . . . , X n } . Pr o of. Let k n = ⌊ β ( n − 1) ⌋ . F or x ∈ S q − 1 , let r (1) n ( x ) ≤ r (2) n ( x ) ≤ · · · ≤ r ( n ) n ( x ) denote the ordered cosine distances { d cos ( x, X i ) } n i =1 . Recall the definitions: LCDD ( β ) ( x, F ) = 2 − µ β ( x ) , where µ β ( x ) = 1 β E F  d cos ( x, Y ) I { d cos ( x, Y ) ≤ ρ ( β ) ( x ) }  , with ρ ( β ) ( x ) satisfying F { d cos ( x, Y ) ≤ ρ ( β ) ( x ) } = β . 11 The sample LCDD is: LCDD ( β ) ( x, F n ) = 2 − ˆ µ n ( x ) , where ˆ µ n ( x ) = 1 k n k n X i =1 r ( i ) n ( x ) . W e prov e uniform conv ergence in three steps. Step 1 Define the empirical pro cess F n ( t ; x ) = 1 n n X i =1 I { d cos ( x, X i ) ≤ t } , and its p opulation coun terpart F ( t ; x ) = P ( d cos ( x, Y ) ≤ t ) . Since the class of sets {{ y : d cos ( x, y ) ≤ t } : x ∈ S q − 1 , t ∈ [0 , 2] } is a VC-class (as geodesic balls on the sphere), we ha ve b y the uniform Glivenk o-Can telli theorem: sup x ∈ S q − 1 sup t ∈ [0 , 2] | F n ( t ; x ) − F ( t ; x ) | a.s. − − → 0 . Let ρ ( β ) ( x ) b e the p opulation β -quantile: F ( ρ ( β ) ( x ); x ) = β . Define the empirical quan tile ˆ ρ n ( x ) = r ( k n ) n ( x ) , which satisfies F n ( ˆ ρ n ( x ); x ) = k n /n → β . By the uniform con tin uity of F ( t ; x ) in t (implied b y f b eing b ounded aw ay from zero) and uniform conv ergence of F n , we obtain: sup x ∈ S q − 1 | ˆ ρ n ( x ) − ρ ( β ) ( x ) | a.s. − − → 0 . Step 2 Define the truncated empirical pro cess: G n ( t ; x ) = 1 n n X i =1 d cos ( x, X i ) I { d cos ( x, X i ) ≤ t } , and its p opulation coun terpart G ( t ; x ) = E [ d cos ( x, Y ) I { d cos ( x, Y ) ≤ t } ] . The class of functions F = { ( x, y ) 7→ d cos ( x, y ) I { d cos ( x, y ) ≤ t } : x ∈ S q − 1 , t ∈ [0 , 2] } is uniformly bounded (b y 2) and, as a product of a Lipsc hitz function d cos ( x, · ) with an indicator of a VC-class, is itself a Gliv enko-Can telli class. Therefore, sup x ∈ S q − 1 sup t ∈ [0 , 2] | G n ( t ; x ) − G ( t ; x ) | a.s. − − → 0 . No w, by the con tinuous mapping theorem and Step 1, G n ( ˆ ρ n ( x ); x ) a.s. − − → G ( ρ ( β ) ( x ); x ) = β µ β ( x ) uniformly in x . But also, G n ( ˆ ρ n ( x ); x ) = 1 n n X i =1 d cos ( x, X i ) I { d cos ( x, X i ) ≤ ˆ ρ n ( x ) } = k n n · ˆ µ n ( x ) . Since k n /n → β almost surely , we conclude: sup x ∈ S q − 1 | ˆ µ n ( x ) − µ β ( x ) | a.s. − − → 0 . Step 3 F rom Steps 1 and 2, we ha ve: sup x ∈ S q − 1 | LCDD ( β ) ( x, F n ) − LCDD ( β ) ( x, F ) | = sup x ∈ S q − 1 | ˆ µ n ( x ) − µ β ( x ) | a.s. − − → 0 . This establishes the desired uniform almost sure conv ergence. 12 Corollary 2. Under the assumptions of L emma 1 , for any β ∗ ∈ (0 , 1] : sup x ∈ S q − 1    LC D D ( β ∗ ) ( x, ˆ F n ) − LC D D ( β ∗ ) ( x, F )    p − → 0 Note that the con tinuit y result in Prop osition 4 do es not hold in the empirical case, where the LCDD is in general a piecewise constant function of β . Sp ecifically , for a fixed p oin t x and k ∈ { 1 , 2 , . . . , n − 1 } , the LCDD ( β ) ( x, ˆ F n ) remains constan t on each in terv al  k n − 1 , k +1 n − 1  , while con tinuit y at β = k / ( n − 1) holds if and only if the cosine distance of the k nearest p oin ts coincides with the a verage of the previous distances. Since this equality is highly unlikely in non-degenerate samples, the empirical LCDD is typically discon tinuous in β . 5 DD-classifier with lo cal depth The depth of a given p oin t characterizes its lo cation w.r.t. the whole distribution. Th us the classifiers whic h use an y global depth function p erform w ell only if the considered distributions ha ve some global prop erties like symmetry or unimo dalit y . T o obtain goo d p erformance also in more general settings, the use of some lo cal depth should b e preferred. The problem that emerged and need to b e handle is c hoice of lo calization lev el. The first classifier which emplo yed local depth w as proposed in 2013 b y Hlubink a and V encalek ( 2013 ) who used a weigh ted halfspace depth. Later on, Painda veine and V an Bev er ( 2013 ) dev eloped the more complex approach we tak e inspiration from for our prop osal. Ho wev er, the just cited works which enables lo calization of any global depth function which is then used in the maxim um depth classifier. Here, we use the prop osed lo cal cosine distance depth to b e applied in the DD-plot, where then a p olynomial separating function is adopted to discriminate b et ween groups. W e focus on tw o-class classification problem. Let { X 1 , . . . , X m } ( ≡ X ) and { Y 1 , . . . , Y n } ( ≡ Y ) b e t wo random samples from F 1 and F 2 , resp ectiv ely , whic h are distributions defined on S q − 1 . As seen in Pandolfo ( 2022 ) and from the definition of the DD-plot, if F 1 = F 2 , then DD-plot should b e concentrated along the 45-degree line. Conv ersely , if the tw o distributions differ, the DD-plot w ould exhibit a departure from the 45-degree line. Hence, given a locality lev el β , the prop osed classifier is then defined as C β ,s ( x ) = ( 2 , if LC D D ( β ) ( x, ˆ F 2 ) ≥ s  LC D D ( β ) ( x, ˆ F 1 )  , 1 , otherwise. F or any given s ( · ) , we then dra w a curv e corresponding to y = s ( x ) in the DD-plot, and assign the observ ations ab ov e the curv e to F 1 and those b elo w it to F 2 , and then calculate the empirical misclassification rate, that is ˆ R s = π 1 m m X i =1 I { LC D D ( β ) ( X i , ˆ F 1 ) ≤ s ( LC D D ( β ) ( X i , ˆ F 2 )) } + π 2 n n X i =1 I { LC D D ( β ) ( Y i , ˆ F 1 ) >s ( LC D D ( β ) ( Y i , ˆ F 2 )) } . (1) Where π i are the prior probabilities of the t wo classes, N = ( m, n ) , and I A is the indicator function whic h tak es 1 if A is true and 0 otherwise. Hence, s ( · ) is estimated to minimize ˆ R s . F ollowing Li et al. ( 2012 ), here we consider s ( x ) = k 0 X i =1 a i x i , where k 0 is the giv en degree of the p olynomial and a = ( a 1 , . . . , a k 0 ) ∈ R k 0 is the coefficient vector of the p olynomial. Theorem 5 (Bay es Consistency of LCDD-DD Classifier) . L et F 1 , F 2 b e distributions on S q − 1 with c ontinuous densities f 1 , f 2 b ounde d away fr om zer o, and let π 1 , π 2 > 0 b e class priors with π 1 + π 2 = 1 . L et R b e a c omp act class of functions (p ointwise c omp act) c onsisting of p olynomials of de gr e e at most k 0 . 13 Define the empiric al LCDD depths D ( β ) ˆ F j ( z ) := LCDD ( β ) ( z , ˆ F j ) for j = 1 , 2 , wher e ˆ F 1 , ˆ F 2 ar e the empiric al distributions fr om samples of sizes n 1 , n 2 r esp e ctively, with total sample size N = n 1 + n 2 . Consider the classifier: C β ,s ( z ) = ( 2 if D ( β ) ˆ F 2 ( z ) ≥ s  D ( β ) ˆ F 1 ( z )  1 otherwise Assume the fol lowing c onditions hold: (A1) Ther e exists a unique β ∗ ∈ (0 , 1] and s B ∈ R such that the classifier C β ∗ ,s B e quals the Bayes classifier almost everywher e, i.e., C β ∗ ,s B ( z ) = I { π 2 f 2 ( z ) > π 1 f 1 ( z ) } a.e. (A2) F or j = 1 , 2 and any β ∈ (0 , 1] , sup z ∈ S q − 1   D ( β ) ˆ F j ( z ) − D ( β ) F j ( z )   p − → 0 as n j → ∞ . (A3) R is c omp act in the top olo gy of p ointwise c onver genc e. (A4) F or e ach N , let B N ⊂ (0 , 1] b e a finite grid such that max β ,β ′ ∈ B N | β − β ′ | → 0 as N → ∞ . Define ˆ s N ,β = argmin s ∈R e R N ( s, β ) , wher e e R N is the empiric al risk, and sele ct ˆ β N = argmin β ∈ B N CV( ˆ s N ,β , β ) , wher e CV denotes cr oss-validate d err or. Final ly, set ˆ s N = ˆ s N , ˆ β N and ˆ C N = C ˆ β N , ˆ s N . Then, as N → ∞ with n j / N → λ j ∈ (0 , 1) for j = 1 , 2 , we have: (i) ˆ β N p − → β ∗ ; (ii) ˆ s N p − → s B p ointwise; (iii) R ( ˆ C N ) p − → R Bay es = R ( C β ∗ ,s B ) . Assumption (A1) r e quir es that ther e exists some lo c ality level β ∗ for which the LCDD-b ase d classi- fier achieves the Bayes err or. This is a r e asonable assumption when the data have lo c al structur e that c an b e c aptur e d at an appr opriate sc ale. In pr actic e, even if the exact Bayes classifier is not achievable with p olynomial sep ar ators in the LCDD sp ac e, the the or em guar ante es that our data-driven pr o c e dur e wil l appr o ach the b est p ossible p erformanc e within the class R . Pr o of. W e adapt the framework of Li et al. ( 2012 ) to our LCDD-based classifier with data-driv en selection of the lo calit y parameter β . Step 1: By Proposition 4 , for any β ∈ (0 , 1) and ∆ β > 0 with β + ∆ β ≤ 1 , sup x ∈ S q − 1   LCDD ( β ) ( x, F j ) − LCDD ( β +∆ β ) ( x, F j )   = O (∆ β ) , j = 1 , 2 , uniformly in x . In particular, for each fixed z , the map β 7→ D ( β ) F j ( z ) is contin uous on (0 , 1] . The misclassification error of the classifier C β ,r is: R ( β , s ) = π 1 P F 1 ( C β ,s ( Z ) = 2) + π 2 P F 2 ( C β ,s ( Z ) = 1) . Since the classifier depends on z only through ( D ( β ) F 1 ( z ) , D ( β ) F 2 ( z )) and the indicator of the decision region, the dominated conv ergence theorem implies that for each fixed s ∈ R , the function β 7→ R ( β , s ) is contin uous on (0 , 1] . 14 Step 2: Fix β ∈ (0 , 1] and j ∈ { 1 , 2 } . By Lemma 1 (which implies assumption A2), sup z ∈ S q − 1   D ( β ) ˆ F j ( z ) − D ( β ) F j ( z )   a.s. − − → 0 . Since B N is finite for each N , a union bound yields: max β ∈ B N sup z ∈ S q − 1   D ( β ) ˆ F j ( z ) − D ( β ) F j ( z )   a.s. − − → 0 as n j → ∞ . Consider the class of decision sets in the depth space: D =  { ( u, v ) ∈ [0 , 2] 2 : v ≤ s ( u ) } : s ∈ R  . Since R consists of p olynomials of b ounded degree, D has finite V C dimension ( Li et al. , 2012 , Theorem 4). Therefore, D is a Glivenk o-Cantelli class, whic h implies: sup β ∈ B N sup s ∈R   b R N ( s, β ) − R ( β , s )   p − → 0 , where b R N ( s, β ) is the empirical risk of C β ,s . Step 3: Define the optimal risk function: R ∗ ( β ) = inf s ∈R R ( β , s ) , β ∈ (0 , 1] . F rom Step 1, R ( β , s ) is contin uous in β for each s , and b y compactness of R , the infimum is attained and R ∗ ( β ) is contin uous. Assumption (A1) implies that β ∗ is the unique minimizer of R ∗ ( β ) ov er (0 , 1] , with R ∗ ( β ∗ ) = R Bay es . F or eac h β ∈ B N , ˆ s N ,β minimizes the empirical risk. By the uniform conv ergence in Step 2 and standard M-estimation theory: sup β ∈ B N   R ( β , ˆ s N ,β ) − R ∗ ( β )   p − → 0 . Since R ∗ is contin uous with unique minimizer β ∗ , and B N b ecomes dense in (0 , 1] as N → ∞ b y (A4), w e obtain: ˆ β N = argmin β ∈ B N CV( ˆ s N ,β , β ) p − → β ∗ . Step 4: F or fixed β = β ∗ , the empirical risk minimizer satisfies ˆ s N ,β ∗ p − → s B b y standard consis- tency results for classification with VC classes. The con vergence ˆ β N p − → β ∗ and con tinuit y of the risk function imply: ˆ s N = ˆ s N , ˆ β N p − → s B . Finally , for the risk con v ergence: | R ( ˆ C N ) − R Bay es | ≤ | R ( ˆ β N , ˆ s N ) − R ( ˆ β N , s B ) | + | R ( ˆ β N , s B ) − R ( β ∗ , s B ) | . The first term con v erges to 0 because ˆ s N p − → s B and r 7→ R ( β , s ) is contin uous uniformly in β on compacts. The second term conv erges to 0 because ˆ β N p − → β ∗ and β 7→ R ( β , s B ) is contin uous. Hence, R ( ˆ C N ) p − → R ( β ∗ , s B ) = R Bay es . In the following sections, we inv estigate the practical p erformance of the LCDD-DD classifier through sim ulations (Section 6 ) and real-data applications (Section 7 ). W e compare it with global depth-based classifiers and other directional classification metho ds 15 6 Sim ulations Among the v arious applications of data depth, supervised classification represents the most promi- nen t one, particularly in the con text of directional data, where the absence of a natural ordering p oses sp ecific c hallenges. In this study , the prop osed lo cal depth function is ev aluated against its global coun terpart through a simulation exp eriment, in which b oth serv e as the underlying mea- sures for DD-classifier training. This section presen ts the sim ulation study designed to compare the p erformance of the Global and Lo cal CDD when incorp orated in to the DD-classifier. T wo distinct sim ulation scenarios are considered, eac h describ ed in detail in the following subsections and including three exp erimen tal setups. 6.1 The sim ulation design The first sim ulation scenario aims to assess the potential classification improv ement achiev ed by the lo cal depth compared to its global coun terpart when observ ations b elonging to the same class are distributed across multiple clusters. The second sim ulation scenario inv estigates a different t yp e of data distribution, c haracterized b y a non-conv ex structure com bined with a strongly pronounced m ultimo dalit y . F or eac h setup, w e generated 100 datasets of size n = 500 . The neighborho od parameter, expressed as the prop ortion β of nearest units within eac h class, tak es v alues β ∈ { 0 . 05 , 0 . 10 , 0 . 25 } , while the data dimensionality v aries across d ∈ { 3 , 10 , 25 } . In accordance with Guyon ( 1997 ), 70% of the observ ations w ere allo cated to the training set and the remaining 30% to the test set. In this study we fo cus on the binary classification setting with the implementation of DD- classifier. Let W 1 i , i = 1 , . . . , n 1 , and W 2 i , i = 1 , . . . , n 2 , denote indep enden t random samples dra wn from the distributions F 1 and F 2 on S q − 1 , resp ectively , where n 1 and n 2 are the cardinalit y of the tw o classes. The prop ortion b et ween the tw o classes w as randomly c hosen to lie betw een 35% and 50%. The ev aluation metric considered for each sim ulated dataset is the misclassification rate (MR), as defined in eq. 1 . The sim ulation was fully implemented in R , using the ddalpha pac k age ( P okot ylo et al. , 2019 ), whic h allows the customization of the DD-classifier training procedure to incorp orate the prop osed depth function and to automatically select the p olynomial degree p through a cross-v alidation sc heme p erformed in the depth space. 6.1.1 Scenario 1 The data of the first simulated scenario w ere generated according to the v on Mises–Fisher (vMF) distribution, which plays for data on the unit hypersphere S q − 1 the same role as the normal distribution do es for unconstrained Euclidean data. A ( q − 1) -dimensional unit random vector x ∈ S q − 1 is said to follo w a vMF distribution if its probabilit y densit y function is f q ( x | µ, κ ) = C q ( κ ) exp( κµ ′ x ) , where ∥ µ ∥ = 1 , κ ≥ 0 , and q ≥ 2 . The normalizing constant is giv en by C q ( κ ) = κ q / 2 − 1 (2 π ) q / 2 I q / 2 − 1 ( κ ) , where I b denotes the modified Bessel function of the first kind and order b . The vMF distribution is characterized b y the mean direction µ and the concen tration parameter κ , which con trols the degree of dispersion of the observ ations around the mean v ector. In the limiting cases, when κ = 0 , the distribution reduces to the uniform density on S q − 1 , whereas as κ → ∞ it degenerates in to a p oin t mass at µ . F or the sim ulation study , three experimental setups were considered. F or eac h setup, the concen tration parameter κ w as randomly dra wn to induce lo w, medium, and high noise lev els in the data. Sp ecifically , κ low ∼ U [15 , 17] , κ medium ∼ U [10 , 12] , and κ high ∼ U [5 , 7] . Eac h class w as generated as a mixture of vMF distributions, each one indicated as F v M F ( µ w j ,κ ) . The c hoice of eac h cen ter µ w j , where j denotes the component within class w , w as made randomly but constrained to lie at specific cosine distances d cos ( · , · ) from the other centers. Starting from the same initial cen ter µ 1 1 = ( η 1 , . . . , η q − 1 , η q ) , where η 1 = 1 and η t = 0 for all t = 2 , . . . , q , the remaining centers w ere generated according to the following setups: 16 • Setup 1: The second class cen ter µ 2 1 is randomly generated and constrained to satisfy d cos ( µ 1 1 , µ 2 1 ) ∈ [0 . 3 , 0 . 5] . Th us, points for each class are drawn from vMF distributions with differen t mean directions but equal concen tration parameter κ : F 1 = F vMF ( µ 1 1 ,κ ) and F 2 = F vMF ( µ 2 1 ,κ ) . • Setup 2: The second component of the first class, µ 1 2 , is randomly generated under the constrain t d cos ( µ 1 1 , µ 1 2 ) ∈ [0 . 6 , 0 . 8] . Then, µ 2 1 is generated so that d cos ( µ 1 1 , µ 2 1 ) = d cos ( µ 1 2 , µ 2 1 ) ∈ [0 . 25 , 0 . 45] . Finally , µ 2 2 is generated under the constrain ts d cos ( µ 2 2 , µ 2 1 ) ∈ [ d cos ( µ 1 1 , µ 1 2 ) − ϵ, d cos ( µ 1 1 , µ 1 2 ) + ϵ ] and d cos ( µ 1 2 , µ 2 2 ) ∈ [ d cos ( µ 1 1 , µ 2 1 ) − ϵ, d cos ( µ 1 1 , µ 2 1 ) + ϵ ] , where ϵ = 0 . 1 . In this case, eac h class is modeled as an equally weigh ted mixture of t wo vMF distributions with differen t mean directions and the same concen tration parameter κ : F 1 = 1 2 F vMF ( µ 1 1 ,κ ) + 1 2 F vMF ( µ 1 2 ,κ ) and F 2 = 1 2 F vMF ( µ 2 1 ,κ ) + 1 2 F vMF ( µ 2 2 ,κ ) . • Setup 3: The second comp onent of the first class, µ 1 2 , is randomly generated suc h that d cos ( µ 1 1 , µ 1 2 ) ∈ [0 . 4 , 0 . 6] . Then, µ 2 1 is generated under the constrain ts d cos ( µ 1 1 , µ 2 1 ) ∈ [0 . 4 , 0 . 6] and d cos ( µ 1 2 , µ 2 1 ) ∈ [0 . 8 , 1] . Finally , µ 2 2 is generated so that d cos ( µ 1 1 , µ 2 2 ) ∈ [0 . 4 , 2] , d cos ( µ 1 2 , µ 2 2 ) ∈ [0 . 4 , 2] , and d cos ( µ 2 1 , µ 2 2 ) ∈ [0 . 8 , 2] . Similarly to Setup 2, eac h class is represented b y an equally w eighted mixture of tw o vMF components: F 1 = 1 2 F vMF ( µ 1 1 ,κ ) + 1 2 F vMF ( µ 1 2 ,κ ) and F 2 = 1 2 F vMF ( µ 2 1 ,κ ) + 1 2 F vMF ( µ 2 2 ,κ ) . The outcomes of this scenario are displa yed in Figure 2 , whic h rep orts the distributions of the misclassification rates across setups, dimensions, and noise levels. A detailed in terpretation of these findings is provided in 6.2 . 6.1.2 Scenario 2 The W atson distribution was selected as the generating model for the second scenario of the sim ulation study . Although originally defined for axial data, the W atson distribution can effectively pro duce non-conv ex and bip olar structures, making it suitable for assessing the b eha vior of local depth functions in complex directional settings. A random unit vector x ∈ S q − 1 follo ws a W atson distribution if its probabilit y density function is given b y f ( x | µ, κ ) = M  1 2 , q 2 , κ  − 1 exp { κ ( µ ′ x ) 2 } , where M (1 / 2 , q / 2 , · ) denotes the Kummer’s function. The parameter µ represents the mean axis, while κ determines the axial concen tration of the data. F or κ > 0 , the distribution is bip olar, and as κ increases, it b ecomes increasingly concen trated around ± µ . F or κ < 0 , it b ecomes a symmetric girdle distribution, with data concentrated in the subspace orthogonal to µ , and the degree of disp ersion is gov erned by the magnitude of κ . As in the first scenario three setups w ere built using the W atson distribution, indicated with F W at ( µ,κ ) , and random v alues of κ w ere used to pro duce low, medium, and high levels of noise in the data. Sp ecifically , κ low ∼ U [15 , 17] , κ medium ∼ U [10 , 12] , and κ high ∼ U [5 , 7] . In each setup, the mean axes of the t wo classes are randomly generated and constrained to hav e a cosine distance b et ween 0 . 5 and 0 . 7 , starting from an initial cen ter defined as µ 1 = ( η 1 , . . . , η q − 1 , η q ) , where η 1 = 1 and η t = 0 for all t = 2 , . . . , q . T o ev aluate different data configu- rations, the sign of the concentration parameter κ v aries across the setups: • Setup 1: The parameter κ is p ositive for both classes, resulting in t w o bip olar distributions: F 1 = F W at ( µ 1 ,κ ) and F 2 = F W at ( µ 2 ,κ ) . • Setup 2: The parameter κ is negative for b oth classes, producing tw o girdle-shaped distri- butions: F 1 = F W at ( µ 1 , − κ ) and F 2 = F W at ( µ 2 , − κ ) . • Setup 3: The parameter κ is p ositiv e for one class and negativ e for the other, leading to t wo response populations with different shapes: F 1 = F W at ( µ 1 ,κ ) and F 2 = F W at ( µ 2 , − κ ) . The results of this scenario are rep orted in Figure 3 . The distributions of the misclassification rates are displa y ed, further conditioned by the setup, data dimension, and noise lev el. A detailed discussion of these results is provided in 6.2 . 17 Figure 2: Simulation results for Scenario 1. The ro ws of the table indicate the sp ecific Setup, the columns the num b er of dimensions, and the three different colors indicate noise levels: Low (L)–green, Medium (M)–blue, High (H)–red. 6.2 Results Considering the results shown in Figure 2 and Figure 3 for the first and second scenario, respec- tiv ely , the prediction error consistently increases with b oth the data dimension and the noise lev el, regardless of the sp ecific neighborho od size (including the CDD). In b oth scenarios, there is ef- fectiv ely no difference b et ween the lo cal and global approac hes when the dimension is 25 and the noise lev el is high, as their classification errors b ecome almost indistinguishable. Moreov er, within the lo cal framew ork, the three neighborho od proportions considered ( 5% , 10% , and 25% ) yield v ery similar performance. F o cusing on the first scenario, which relies on the vMF distribution, in the first setup, that do es not inv olve multimodality or non-con vexit y , all metho ds exhibit nearly identical performances. Nev ertheless, when the dispersion b ecomes v ery high, the t wo classes tend to o verlap, making it sligh tly more adv antageous to use larger neigh b orho od proportions for the depth computation, although the improv emen t is mo dest. As the structure b ecomes more complex, as in the second setup, the CDD b egins to display a higher prediction error compared to the LCDD, indicating that a local p erspective is preferable in this setting, particularly under low noise. F or higher noise lev els and 10 dimensions or more, ho wev er, the p erformance of all metho ds becomes essentially indistinguishable. In the third setup of the same scenario, except for the previously discussed cases regarding the general behaviour in high dimensions and high noise, the global approach consisten tly underp er- forms the LCDD. In particular, for 3 and 10 dimensions under lo w noise, the lo cal depth ac hieves a misclassification error b elo w 1% , demonstrating excellent performance ev en in the presence of a highly structured and challenging data configuration. The classification p erformance of the CDD was previously in vestigated by P andolfo and D’Ambrosio ( 2021 ), who compared it with other depth measures and sho wed that it p erforms v ery w ell in sev- eral settings in volving the vMF distribution. How ever, in the context of the second scenario, which is based on the W atson distribution instead, the results are straigh tforward to interpret: the CDD consisten tly underp erforms compared to the LCDD under all imp osed conditions. This indicates that the global CDD lac ks the flexibilit y required to adapt to scenarios in whic h classes are not 18 Figure 3: Simulation results for Scenario 2. The ro ws of the table indicate the sp ecific Setup, the columns the num b er of dimensions, and the three different colors indicate noise levels: Low (L)–green, Medium (M)–blue, High (H)–red. w ell separated, as is the case in this sim ulation design. T o further enric h the comparison b et ween the CDD and the LCDD, the next section presen ts an application to real datasets. This additional analysis allo ws us to assess the practical performance of the lo cal depth approac h in real-world situations and highligh ts its p oten tial as an effectiv e and flexible strategy for directional-data classification through depth-based metho ds. 7 Real data examples In this section we compare the p erformance of the proposed classifier in comparison with its global v ersion b y means of tw o real-world datasets. W e ran a 10-fold cross-v alidation rep eated on 10 differen t training sets ( P aindav eine and V an Bev er , 2013 ) in order to select the best possible v alue of β in the set { 0 . 01 , 0 . 05 , 0 . 1 , 0 . 25 , 0 . 5 , 1 } according to the low est a verage misclassification rate (MR). 7.1 Wholesale customers The first real data refers to clients of a wholesale distributor. When dealing with marketing applications, the target groups can themselves b e comp osed of differen t segments of the p opulation, whic h usually presen t differen t sp ending habits. Thus, it could be more efficien t to in tro duce more flexibilit y when classifying these units through the use of depth functions, allo wing for a more lo cal fo cus. There are a total of 440 observ ations and 7 v ariables. The first tw o are categorical v ariables. Of these tw o, we are interested in the v ariable Channel, which can b e either Horeca (Hotel/Restauran t/Cafè) or Retail c hannel (Nominal), and will define our t w o classes for this problem. The last 5 v ariables ha ve information about the ann ual sp ending in monetary units on div erse pro duct categories: (1) fresh pro ducts, (2) milk products and (3), (4) frozen products, (5) detergen ts and paper pro ducts and (6) delicatessen pro ducts. W e treat this data as comp ositional data, exploiting the square-ro ot transformation so that the points lie on a unit h yp ersphere. As apparen t from Fig. 4 , here the repeated cross-v alidation will select β = 0 . 05 , with an a verage MR of 0 . 15 , and shows an increasing trend in the av erage missclassification rate starting from β = 0 . 05 19 to β = 1 . In this example, the lo cal approac h has brought an av erage improv ement o ver the global approac h of about 4 . 5 p ercen tage p oin ts. Figure 4: Repeated 10-fold cross-v alidation results for the Wholesales dataset. On the x–axis there are the β s, on the y–axis the cross–v alidated MR, and the dotted red line highlights the v alue of β achieving the minim um MR. 7.2 SP AM database The second dataset w e propose classifies 4601 emails as spam or non-spam. What in terested us ab out this textual dataset w as b oth the relev ance of the application, since spam detection is an imp ortan t issue, as spam e-mails can range from simply annoying to actually dangerous, and the complexit y of the data, which is of high dimensionalit y . In fact, there are a total of 57 v ariables, of whic h we select the last one, which con tains information ab out the classification, and the first 48, whic h instead con tain information on the p ercen tage of words in the e-mail that matc h a certain w ord. Since the p ercen tages of the 48 w ords c hosen to classify the email do not sum up to 1, w e added one last v ariable, which is the complement of 1 of the sum of these percentages. This is done to correctly visualize the p oint on a hypersphere through the square ro ot transformation, since normalizing the data without it would giv e a very differen t interpretation to the results. T reating this as compositional data allows us to ov ercome the biases driven by the different lengths of the emails, which is the usual strategy in text mining applications ( Dhillon and Mo dha , 2004 ). Again, Fig. 5 paints a clear picture of the c hoice of β . Quite interestingly , something that did not occur during our simulations, the b est p erforming β is 0.01, with an av erage MR of 0 . 12 , p ossibly due to the presence of a higher num ber of total observ ations. The figure also sho ws a trend that increases steadily up to β = 0 . 5 , and an a verage difference b et ween the chosen lo cal and the global depth of 8 p ercen tage p oints. 8 Conclusion This pap er introduces the lo cal cosine distance depth (LCDD) for directional data on the hy- p ersphere. LCDD extends the cosine distance depth to capture lo cal cen trality in multimodal distributions using a neighbourho od approach. When applied to the Depth vs. Depth (DD) classifier, LCDD enables b etter separation, ev en for non-conv ex class structures. The prop osed LCDD-based classifier is compared with its global counterpart in an extensiv e simulation study . The results demonstrate the effectiveness of the local approach, regardless of the c hosen β lev el, except in cases of high noise and dimensions, where the global and local depth functions pro duce similar results. These results are further confirmed b y t wo real-data examples. F uture researc h will fo cus on extending the approac h to multiclass settings and different manifolds. 20 Figure 5: Repeated 10-fold cross-v alidation results for the Spam dataset. On the x–axis there are the β s, on the y–axis the cross–v alidated MR, and the dotted red line highlights the v alue of β ac hieving the minim um MR. References Agostinelli, C. and Romanazzi, M. (2011). Lo cal depth. Journal of Statistic al Planning and Infer enc e , 141(2):817–830. Agostinelli, C. and Romanazzi, M. (2012). Depth analysis of directional data. In Pr o c e e dings of the 46th Scientific Me eting of the Italian Statistic al So ciety (SIS 2012) , Rome, Italy . Italian Statistical So ciet y (SIS). Agostinelli, C. and Romanazzi, M. (2013). Nonparametric analysis of directional data based on data depth. Envir onmental and Ec olo gic al Statistics , 20(2):253–270. Cuesta-Alb ertos, J. A., F ebrero-Bande, M., and Oviedo de la F uente, M. (2017). The dd g-classifier in the functional setting. T est , 26(1):119–142. Demni, H., Messaoud, A., and P orzio, G. C. (2019). The cosine depth distribution classifier for directional data. In Applic ations in Statistic al Computing: F r om Music Data A nalysis to Industrial Quality Impr ovement , pages 49–60. Springer. Dey , S. and Jana, N. (2025). Classification rules for axial data: Parametric and nonparametric approac hes. Journal of Classific ation , pages 1–31. Dhillon, I. S. and Modha, D. S. (2004). Concept decompositions for large sparse text data using clustering. Machine L e arning , 42:143–175. Guy on, I. M. (1997). A scaling law for the v alidation-set training-set size ratio. Hlubink a, D. and V encalek, O. (2013). Depth-based classification for distributions with nonconv ex supp ort. Journal of Pr ob ability and Statistics , 2013(1):629184. Ley , C., Sabbah, C., and V erdeb out, T. (2014). A new concept of quantiles for directional data and the angular mahalanobis depth. Ley , C. and V erdeb out, T. (2017). Mo dern Dir e ctional Statistics . Chapman & Hall/CR C In terdis- ciplinary Statistics. CRC Press, Boca Raton, FL. Li, J., Cuesta-Alb ertos, J. A., and Liu, R. Y. (2012). Dd-classifier: Nonparametric classification pro cedure based on dd-plot. Journal of the Americ an Statistic al Asso ciation , 107(498):737–753. Liu, R. Y., P arelius, J. M., and Singh, K. (1999). Multiv ariate analysis by data depth: descriptive statistics, graphics and inference, (with discussion and a rejoinder b y Liu and Singh). The A nnals of Statistics , 27(3):783 – 858. 21 Liu, R. Y. and Singh, K. (1992). Ordering directional data: Concepts of data depth on circles and spheres. The A nnals of Statistics , 20(3):1468–1484. Mardia, K. V. and Jupp, P . E. (1999). Dir e ctional Statistics . Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Chichester. P aindav eine, D. and V an Bever, G. (2013). F rom depth to lo cal depth: a fo cus on centralit y . Journal of the A meric an Statistic al Asso ciation , 108(503):1105–1119. P andolfo, G. (2022). The gld-plot: a depth-based graphical tool to inv estigate unimodality of directional data. Journal of Statistic al Computation and Simulation , 92(11):2372–2385. P andolfo, G. and D’Am brosio, A. (2021). Depth-based classification of directional data. Exp ert Systems with Applic ations , 169:114433. P andolfo, G., Painda v eine, D., and P orzio, G. C. (2018). Distance-based depths for directional data. Canadian Journal of Statistics , 46(4):593–609. P okot ylo, O., Mozharovskyi, P ., and Dyc kerhoff, R. (2019). Depth and depth-based classification with r pack age ddalpha. Journal of Statistic al Softwar e , 91(5):1–46. Small, C. G. (1987). Measures of cen trality for m ultiv ariate and directional distributions. The Canadian Journal of Statistics / L a R evue Canadienne de Statistique , 15(1):31–39. Stephens, M. A. (1982). Use of the von mises distribution to analyse con tin uous prop ortions. Biometrika , 69(1):197–203. T ukey , J. W. (1975). Mathematics and the picturing of data. In Pr o c e e dings of the international c ongr ess of mathematicians , v olume 2, pages 523–531. V ancouver. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment