Choice of neighbor order in nearest-neighbor classification

The $k$th-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manne…

Authors: Peter Hall, Byeong U. Park, Richard J. Samworth

The Annals of Statistics 2008, V ol. 36, No. 5, 2135–21 52 DOI: 10.1214 /07-A OS537 c  Institute of Mathematical Statistics , 2008 CHOICE OF NEIGHBOR ORDER IN NEAREST-NEIGHBOR CLASSIFICA TION By Peter Hall, Byeong U. P ark 1 and R icha rd J. S amwor th Austr alian National University , Se oul National University and University of Cambridge The k t h-nearest neighbor rule is arguably th e simplest and most intuitiv ely app ealing nonparametric classification pro cedure. How- ever, application of this metho d is inhibited by lack of knowl edge about its properties, in particular, ab out the manner in whic h it is influenced by the v alue of k ; and by the absence of techniques for em- pirical c hoice of k . In the presen t pap er we detail the w a y in which the v alue of k d etermines the misclassi fication error. W e consider t w o models, Pois son and Binomial, for the training samples. U nder t h e first mod el, data are recorded in a P oisson stream and are “assigned” to one or oth er of the tw o p opulations in accordance with the prior probabilities. In p articular, th e t otal num b er of data in b oth training samples is a Pois son-distributed random v ariable. Under the Binomial model, how ever, the total num b er of data in th e training samples is fixed, although again each d ata va lue is assigned in a random wa y . Al- though the v alues of risk and regret associated with the Poiss on and Binomial models are different, they are asymptotically equiva lent to first order, and also to the risks associated with kernel-based classi- fiers th at are tailored to the case of tw o deriv atives. These prop erties motiv ate new metho ds for choosing the v alue of k . 1. In tro d uction. In the classification or d iscrimination problem with t wo p opulations, denoted by X and Y , one wishes to classify an observ ation z to either X or Y using only training data. Th e k th-nearest neigh b or classifica- tion rule is arguably the simplest and most intuiti v ely app ealing n onpara- metric classifier. It assigns z to p opu lation X if at least 1 2 k of the k v alues in the p o oled training-data set nearest to z are f rom X , and to p opulation Y otherwise. The first s tu d y of this metho d was u ndertak en by Fix and Ho dges Received June 2007; revised August 2007. 1 Supp orted by KOSEF (pro ject R 02-2004-00 0-10040-0). AMS 2000 subje ct classific ations. Primary 62H30; secondary 62G20. Key wor ds and phr ases. Bay es classifier, b ootstrap resampling, Edgeworth expansion, error p robabilit y , misclassificatio n error, nonparametric classification, Poisson d istribution. This is an electronic r eprint of the origina l article published by the Institute of Mathematical Statistics in The Annals of Statistics , 2008, V o l. 36 , No. 5, 2135 –2152 . This re pr int differs from the original in pagination and typog r aphic detail. 1 2 P . HALL, B. U. P ARK AND R . J. SAMWOR TH ( 1951 ). Since then there ha ve b een many in vesti gations int o the m ethod ’s statistic al p r op erties. Little is kno wn ab out the structure of its error pr oba- bilities, ho w ev er, and neither are formulae a v ailable for optimal choice of k . Practical metho ds for o ptimal empirical c hoice of k ha v e app arently not b een give n. The pr esen t pap er r esolv es these issues, and fo cuses on expansions of the error r ate of k th-nearest neigh b or cla ssifiers which are associated with optimal c hoice o f k . W e sho w that th e v alues of risk o f nearest-neig h b or classifiers can b e represen ted quite simply in terms of prop erties of the t wo p opulations, and that this leads to new, practical w ays of c ho osing the v alue of k . The sizes of the tr aining samples used to constru ct classifiers migh t fairly b e view ed as random v ariables. Consider, for example, t he case where a classifier is used b y a bank to determine, from the bank’s d ata, wh ether a n ew customer is lik ely to default on a loan. The sizes of the t wo training samples could b e the n um b er, M , of previous customers who defaulted, and th e n um b er, N , who did not default, resp ectiv ely . An appropr iate mo del for the distributions of M and N migh t b e that they are statisticall y indep end en t and P oisson, with means µ and ν , sa y . F or example, the P oisson sample-size mo del could arise if the p opulation of p oten tial customers we re muc h larger than the n um b er of customers who sough t loans fr om the b ank . Th us, Poisson rather than d eterministic mo dels for training-sample sizes can b e motiv ated. Here, the total num b er of d ata in the t wo training samples is random, and data in a P oisson stream are “assigned” to one or other of the t wo p opulations using a form ula wh ic h is b ased on the resp ectiv e prior probabilities. A different approac h , whic h giv es rise to a Binomial- t yp e mo del, in v olve s the total n u m b er of training data b eing pre-determined, but app ortions these data among the tw o p opulations in a manner similar to the Po isson m o del. W e shall sho w that these t w o approac hes p ro duce nearest-neigh b or classifiers with r isks that are d ifferen t bu t are n ev ertheless first-order equiv alen t. F or fixed k the r isk of a k -n earest neigh b or classifier con verge s to its limit relativ ely quic kly , at rate T − 2 , as total sample size, T , increases [Co v er ( 1968 )]. Ho w ev er, the limiting v alue is s trictly larger than the Ba yes risk of the “ideal” classifier that w ould b e used if b oth p opulation densities w ere kno wn . By w a y of comparison, in the case of imp erfect information ab out the p opulation, and in particular, in p arametric settings, th e risk of empir - ical Ba yes classifiers con verges to the Ba y es risk no more r apidly than T − 1 ; see Kharin and Duc h insk as ( 1979 ). In nonparametric settings the rate of con ve rgence to Ba y es risk is slow er still, but ma y neve rtheless b e asymptoti- cally optimal; see, for example, Marron ( 1983 ) and Mammen and Tsybako v ( 1999 ). NEAREST-NEIGHBOR CLASSIFICA TION 3 In most pr evious w ork on nearest-neigh b or classifiers, the v alue of k w as held fixed. Cov er and Hart ( 1967 ) ga v e upp er b ound s for the limit of the risk of nearest-neigh b or classifiers. W agner ( 1971 ) and F ritz ( 197 5 ) treat ed con ve rgence of the conditional err or r ate when k = 1. Devro y e and W ag- ner ( 1977 , 1982 ) dev elop ed and discussed theoretical prop erties, particu- larly issues of mathematical consistency , for k -nearest-neigh b or rules. De- vro ye ( 1981 ) found an asymptotic b ound for the regret with resp ect to the Ba y es classifier. Devroy e et al. ( 1994 ) ga ve a particularly general descrip- tion of strong consistency for nearest-neigh b or metho ds. Psaltis, Snapp and V enk atesh ( 1994 ) generalized the results of Co ver ( 196 8 ) to general dimen- sion, and Snapp and V enk atesh ( 1998 ) fur th er extended the results to the case of m ultiple classes. Bax ( 2000 ) ga v e probabilistic b ounds for the con- ditional error rate in the case where k = 1. Kulk arni and Posner ( 1995 ) addressed nearest-neigh b or metho d s for quite general dep enden t d ata, and Holst and Irle ( 2001 ) pro vided form ulae for the limit of the error rate in the case of dep enden t d ata. Related researc h includes that of Gy¨ orfi ( 1978 , 1981 ) and Gy¨ orfi and Gy¨ orfi ( 1978 ), wh o inv estigated the rate of con v ergence to the Ba y es r isk when k tends to in finit y as T increases. In the case of classifiers based on second-order k ern el densit y estimat ors, and for p opulations w ith t wice-differen tiable densities, the risk typica lly con- v erges to the Ba y es risk at rate n − 4 / ( d +4) , where d denotes the n umber of dimensions. See, for example, Kh arin ( 1982 ), Raudys and Y oung ( 2004 ) and Hall and Kang ( 200 5 ). In a minimax s en s e that Marron ( 1983 ) mak es precise, this rate is optimal. As w e sho w in this pap er, nearest-neigh b or classifiers with P oisson or Binomial inte rpretations of sample size h a ve the same pr op- ert y . Recen t work on pr op erties of classifiers fo cuses largely on d eriving up - p er and low er b ounds to regret in cases wh ere the classification problem is relativ ely d ifficult, for example, where the classification b oundary is compar- ativ ely unsm o oth. Researc h of Audib ert and Tsyb ako v ( 2005 ) and K ohler and Krzyzak ( 2006 ), for example, is in this category . The work of Mammen and Tsybako v ( 1999 ), whic h p ermits the smo othness of a classification prob- lem to b e v aried in the conti n uum, form s something of a bridge b et ween the smo oth case, whic h w e treat, and the rough case. There is a literature on emp irical c hoice of k ; see, for example, Chapter 26 of Devro y e, Gy¨ orfi and Lu gosi ( 1996 ) and S ections 7.2 and 8.4 of Gy¨ orfi et al. ( 2002 ). More generally , Devro ye, Gy¨ orfi and Lugosi ( 1996 ) explored the prop erties and features of nearest-neigh b or m etho d s in the setting of pattern recognitio n. Chapter 5 of th at monograph giv es a go o d guide to the literature in this setting. 4 P . HALL, B. U. P ARK AND R . J. SAMWOR TH 2. Main results. 2.1. Differ ent interpr etatio ns of sample size. Assume w e ha ve m id en ti- cally distribu ted d ata X = { X 1 , . . . , X m } , and n ident ically distribu ted data Y = { Y 1 , . . . , Y n } , all of them d -v ariate and m utually indep end en t. Let the resp ectiv e probab ilit y densities b e f and g . Give n a compact set R ⊆ R d , w e wish to u se the d ata to classify a new datum z ∈ R as coming fr om the X or Y p opulation. Note that w e do not assume f and g themselv es to b e compactly supp orted; the constrain t is only that we confin e atten tion to the problem of classifying new data that come from a giv en compact region R . In many instances the ratio of the sizes of the d atasets is a go o d appro x- imation to the r atio of the prior probabilities of ob s ervin g the resp ectiv e p opulations. W e shall adopt this viewp oint, which raises the issue of how w e should in terpr et m and n . Two mo dels arise in a natural wa y: the P ois- son, where the individual sample sizes are Poi sson-distributed and data are assigned randomly to one prop ortion or another, in prop ortion to the resp ec- tiv e lik eliho o ds; and the Binomial, where the sum of the tw o training-sample sizes is d eterministic but data are ascrib ed to p opulations in the same fash- ion as b efore. The P oisson case can b e viewe d as the result of taking a sample from a mark ed p oint pro cess in R d , and assigning marks in a w ay that re- flects p rior p robabilities; and the Binomial case is the result of conditioning on total sample size in the P oisson setting. In the sense that it av oids the conditioning step, the P oisson case is the more n atur al and has the greate r degree of symmetry . Therefore, we tak e that as the basis for analysis, and tac kle the Binomial mo del b y reference to the solution in the Poisso n case. In mult i-p opulation cases, th e k th nearest-neigh b or classifier would typi- cally b e used to assign z to p opulation j if that p opulation accoun ted f or the greatest n umb er of data among the k v alues in the p o oled dataset that are nearest to z . Our results apply directly to this case, provi ded w e w ork within a compact region at eac h p oin t of whic h th e maximum v alue of the p opula- tion densities is ac hiev ed b y no more than tw o densities. Another straigh t- forw ard extension is to the case where distance is measured in a w eigh ted Euclidean m etric; w e shall w ork only with the standard, u n w eigh ted form. 2.2. Poisson mo d el. Assume that X = { X 1 , X 2 , . . . } and Y = { Y 1 , Y 2 , . . . } represen t p oin ts of type X and type Y , resp ectiv ely , in a t w o-t yp e mark ed P oisson pro cess, P , in R d , with intensit y fun ction µf + ν g , and resp ectiv e probabilities ψ ( z ) = µf ( z ) µf ( z ) + ν g ( z ) (2.1) NEAREST-NEIGHBOR CLASSIFICA TION 5 and 1 − ψ ( z ) th at a p oint of P at z is of typ e X or of t yp e Y . In p articular, the resp ectiv e prior probabilities of the X and Y p opulations are µ/ ( µ + ν ) and ν / ( µ + ν ). It will b e assumed that f and g are held fixed, and that µ and ν satisfy µ = µ ( ν ) increases with ν , in suc h a manner that µ/ ( µ + ν ) → p ∈ (0 , 1) as ν → ∞ . (2.2) Define ρ = p f / { pf + (1 − p ) g } , a function on R d . Supp ose to o that the resp ectiv e densities, f and g , of the X and Y p op- ulations, satisfy the set S ⊆ R , d efi n ed as th e lo cus of p oin ts z for wh ich ρ ( z ) = 1 2 , is of cod imension 1 and of finite measure in d − 1 d imensions. (2.3) the distributions w ith densities f and g ha v e fi nite second m o- men ts; f and g are b oth con tinuous in an op en set con taining R , and b oth ha v e t wo con tinuous deriv ativ es within an op en set con taining S ; and f + g > 0 on R ; (2.4) The fi rst p art of ( 2.3 ) asks that S b e a ( d − 1)-dimensional s tr u cture—a set of isolated p oints if d = 1, a set of curves in the plane if d = 2, and so on. The assump tion of t w o deriv ativ es in ( 2.4 ) is to b e exp ected since, as noted in Section 1 , the con verge nce rate of regret th at is ac hieved by nearest- neigh b or metho ds is optimal u nder that sm o othness assump tion. The con- dition that the deriv ativ es assumed in ( 2.4 ) are con tin u ous is imp osed only so that a concise asymptotic form ula for regret can b e giv en; see ( 2.8 ) b e- lo w. Without the p r ecision pro vided b y the con tinuit y assumption, w e could state only an upp er b ound for regret, in whic h the righ t-hand side of ( 2.8 ) w as replaced b y O { k − 1 + ( k /ν ) 4 /d } . W e ask to o that the slop es at whic h the tw o dens ities, weigh ted in p rop or- tion to their prior probabilities, m eet along S b e b ounded aw a y from zero along S . That is, the fun ction a ( z ) 2 ≡ d X j =1  p ∂ f ( z ) ∂ z j − (1 − p ) ∂ g ( z ) ∂ z j  2 is b ounded a wa y from zero on S . (2.5) Equiv alen tly , the p rior-w eighte d densities cross at an angle, rather than meet in a tangen tial wa y . If the prior-w eight ed densities were to hav e exactly equal gradien ts at crossing places, then there w ould b e an explicit and in timate connection b et wee n the d istributions of X and Y p opulations that could hardly arise b y chance. I t is difficult to en visage that p erfect alignmen t of densities at crossing p oin ts w ould actually o ccur commonly in practice. W rite dz 0 for a n infinitesimal elemen t of S , ce n tred at z 0 . Let a d = π d/ 2 / Γ(1 + 1 2 d ) denote the conte n t of the unit d -dimensional sph ere, define 6 P . HALL, B. U. P ARK AND R . J. SAMWOR TH λ = p (1 − p ) − 1 f + g and α ( z ) = d d + 2 λ ( z ) − 1 − (2 /d ) a − 2 /d d d − 1 d X j =1  ρ j ( z ) λ j ( z ) + 1 2 ρ j j ( z ) λ ( z )  , (2.6) where z = ( z (1) , . . . , z ( d ) ), λ j ( z ) and ρ j ( z ) d enote the first deriv ativ es of the resp ectiv e functions with resp ect to z ( j ) , and ρ j j ( z ) is the second deriv ativ e of ρ ( z ) with r esp ect to z ( j ) . Put ˙ ρ = ( ρ 1 , . . . , ρ d ). Let Φ den ote the cumulati v e distribution fu nction of th e standard n or- mal distribu tion and let Ψ 1 ( z ) = k ˙ ρ ( z ) k − 1 { p ˙ f ( z ) − (1 − p ) ˙ g ( z ) } T ˙ ρ ( z ). It can b e sh o wn that, on S , a ( z ) = Ψ 1 ( z ) = 4 h ( z ) k ˙ ρ ( z ) k , where h ( z ) denotes the common v alue that pf ( z ) and (1 − p ) g ( z ) assum e at z ∈ S . Therefore, since assumptions ( 2.3 )–( 2.5 ) imp ly that a and h are b oun ded a wa y from zero and infinit y on S , they also ensure that Ψ 1 ( z ) and k ˙ ρ ( z ) k are b ound ed a wa y from zero and infin it y there. It follo ws that the constan ts C 1 and C 2 , giv en b y C 1 = Z S Ψ 1 ( z 0 )(8 k ˙ ρ ( z 0 ) k 2 ) − 1 dz 0 = 1 2 Z S h ( z 0 ) k ˙ ρ ( z 0 ) k dz 0 , (2.7) C 2 = 4 Z S Ψ 1 ( z 0 )(8 k ˙ ρ ( z 0 ) k 2 ) − 1 α ( z 0 ) 2 dz 0 = 2 Z S h ( z 0 ) k ˙ ρ ( z 0 ) k α ( z 0 ) 2 dz 0 , are fin ite, that C 1 is nonzero, and that C 2 = 0 if and only if α is iden tically zero on S . The Bay es classifier assigns z to the X or Y p opulation according as ψ ( z ) ≥ 1 2 or ψ ( z ) < 1 2 , resp ectiv ely . Therefore, th e Ba y es risk for classification on R is risk Po is Ba ye s = Z R min  µf µ + ν , ν g µ + ν  , where, here and b elo w, the sup erscript “P ois” will indicate that the setting of the Poi sson mo del is b eing considered. The risk of the k -nearest neigh b or classifier, wh ich assigns z to p opulation X if at least 1 2 k of the k v alues of P oisson data n earest to z are from X , and to p opulation Y otherwise, is risk Po is k - nn = µ µ + ν Z R f ( z ) P Po is ( z classified by k -n n ru le as typ e Y ) dz + ν µ + ν Z R g ( z ) P Po is ( z classified b y k -nn rule as t yp e X ) dz . A p ro of of the follo win g result is giv en in Section 4 . Theorem 1. Assume the Poisson mo del, that ( 2.2 )–( 2.5 ) hold, and that 1 ≤ k 1 ( ν ) < k 2 ( ν ) , wher e k 1 ( ν ) /ν ε → ∞ and k 2 ( ν ) = O ( ν 1 − ε ) for some 0 < NEAREST-NEIGHBOR CLASSIFICA TION 7 ε < 1 . Then, risk Po is k - nn − r isk Po is Ba ye s = C 1 k − 1 + C 2 ( k /ν ) 4 /d + o { k − 1 + ( k /ν ) 4 /d } , (2.8) uniformly in k 1 ( ν ) ≤ k ≤ k 2 ( ν ) . Result ( 2.8 ) implies that, p r o vided α is not iden tically zero, the optimal k satisfies k Po is opt ∼ const .ν 4 / ( d +4) . T o s et ( 2.8 ) in to con text, we note that a general f ormula for the d ifference b et we en the risk of an empirical classifier and th e Ba yes risk can b e dev elop ed fr om the theory of “plug-in decisions”; see Th eorem 2.2, page 16, of Devro y e, Gy¨ orfi and L u gosi ( 1996 ), and Th e- orem 6.2, page 93, of Gy¨ orfi et al. ( 2002 ). When sp ecialize d to the case of nearest-neigh b or metho ds, this argument b ounds the left-hand side of ( 2.8 ) b y a constant multiple of { k − 1 + ( k /ν ) 2 /d } 1 / 2 , the minimum order of wh ic h is ν − 1 / ( d +2) . Mammen and Tsyb ak ov ( 1999 ) sho wed that, in the case where discrimination b oundaries are smo oth, substantia lly faster con v ergence r ates are p ossible. Result ( 2.8 ) and its analogues in the setting of Theorem 2 give concise accoun ts of those faster rates in the case of nearest-neigh b or meth- o ds. Expansion ( 2.8 ) has a close analogue in the setting of second-order, kernel- based method s . See, for example, form ulae (3) of Kharin ( 1982 ) and (A.2) of Hall and Kang ( 2005 ). 2.3. Binomial mo del. In the P oisson mo del w e can think of the d ata as arriving in a stream ( Z 1 , L 1 ) , ( Z 2 , L 2 ) , . . . , w here Z 1 , Z 2 , . . . comprise a P oisson pro cess in R d , with intensit y function µf + ν g , and the “lab els” L i form a sequence of zeros and ones, indep endent of one another conditional on the Z i ’s, with P ( L i = 0 | Z i ) = ψ ( Z i ) and ψ defined b y ( 2.1 ). If L i = 0, then Z i is lab eled as coming fr om the X -p opulation, w hereas if L i = 1, then Z i is lab eled as Y . Since the inte gral of the P oisson-pro cess int ensit y o ver R d equals µ + ν , then the num b er of p oin ts Z i equals a P oisson-distributed random v ariable, T , sa y , with mean µ + ν . In the Binomial mo del w e use the same pro cess to generate d ata, but no w w e condition on T . It is con v enien t to think of T as m + n , w here m = µT / ( µ + ν ) an d n = ν T / ( µ + ν ) are the r esp ectiv e a v erage num b ers of p oin ts that would o ccur in the t wo training samples if w e w ere to adopt the pro cedure indicated ab o ve. (In particular, m and n are not n ecessarily integ ers.) In this notation the risk for the n earest-neigh b or classifier und er the Binomial mo del can b e written as risk Bin k - nn = m T Z R f ( z ) P Bin ( z classified b y k -nn rule as t yp e Y ) dz + n T Z R g ( z ) P Bin ( z classified by k -n n ru le as t yp e X ) dz , 8 P . HALL, B. U. P ARK AND R . J. SAMWOR TH where w e use the sup erscript Bin to indicate that we are s ampling under the Binomial mo del. If w e su pp ose that µ + ν = T a non r andom in teger , (2.9) then these manipulations are un necessary , an d so we shall assume ( 2.9 ) b elo w. This condition also implies that the Ba y es r isk u nder the Binomial mo del, risk Bin Ba ye s , is iden tical to its counte rpart u nder the Poi sson mo del, and that helps to fur ther simplify comparisons. Theorem 2. Assume the Binomial mo del , that ( 2.2 )–( 2.5 ) and ( 2.9 ) hold, and that k 1 and k 2 satisfy the c onditions imp ose d on them in The o- r em 1 . Then, risk Bin k - nn − r isk Po is k - nn = o { k − 1 + ( k /ν ) 4 /d } , (2.10) uniformly in k 1 ( ν ) ≤ k ≤ k 2 ( ν ) . A p ro of of Theorem 2 is giv en in a longer v ersion of this pap er [Hall, P ark and Samw orth ( 2007 )]. F orm ula ( 2.10 ) asserts that the difference b et ween risk Bin k - nn and r isk Po is k - nn is of smaller order than the d ifference b et w een risk Po is k - nn and risk Po is Ba ye s [see ( 2.9 ) for the latter d ifference], and hence, implies that the expansion of regret at ( 2.8 ) is equally v alid if risk Po is k - nn and risk Po is Ba ye s there are replaced by risk Bin k - nn and risk Bin Ba ye s , r esp ectiv ely . 2.4. Empiric al choic e of k opt . The theoretical results d escrib ed earlier can b e used to motiv ate practical metho ds for choosing k . W e sh all treat the Poisson mo del; the Binomial mod el can b e addressed similarly . Let M and N b e th e resp ectiv e sizes of the training samp les X and Y . Generat e M ∗ and N ∗ , resp ectiv ely , from the Poisson distributions with means equal to M and N . Let 0 < r < 1. Dra w b o otstrap resamples X ∗ and Y ∗ , of resp ectiv e sizes M ∗ 1 = [ r M ∗ ] , N ∗ 1 = [ r N ∗ ] from X and Y . Here, [ x ] denotes the in teger part of x . This c h oice of M ∗ 1 and N ∗ 1 implies that the total resample size equals r ( M ∗ + N ∗ ), except for rounding errors arising from taking integ er parts. Note to o that M ∗ 1 / ( M ∗ 1 + N ∗ 1 ) equals the sampling fraction M ∗ / ( M ∗ + N ∗ ) (again mo d ulo inte ger-part rounding). This is necessary if our b o otstrap algorithm, based on rep eated resamples of sizes M ∗ 1 and N ∗ 1 , is to mimic prop erties of the original sampling algorithm. Dra w additional resamples X ∗ test and Y ∗ test , of resp ectiv e sizes M ∗ − M ∗ 1 and N ∗ − N ∗ 1 from X and Y . Build n ear-neigh b or classifiers based on X ∗ NEAREST-NEIGHBOR CLASSIFICA TION 9 and Y ∗ . Use them to classify the data X ∗ test and Y ∗ test , and compute the resulting error rate. Ave rage this rate o ve r a large num b er of c hoices of {X ∗ , X ∗ test } and {Y ∗ , Y ∗ test } . Cho ose k = ˆ k opt to minimize the a v erage error rate; it is an estimator of the v alue of k opt ( r µ, r ν ) that we w ould use if the true in tensit y fun ction were r ( µf + ν g ), rather than µf + ν g . Conv ert ˆ k opt to an empirical v alue, ˜ k opt = r − 4 / ( d +4) ˆ k opt , that is of the righ t size for classificatio n starting from the s amples X and Y . In the case of the binomial sample-size mo del, one ma y follo w the same b o otstrapping p ro cedure as in the P oisson case, but generating M ∗ from Binomial( M + N , M / ( M + N )) and taking N ∗ = M + N − M ∗ . 3. Numerical prop erties. W e present the results of a n u merical exp eri- men t demonstrating the effectiv eness of the empir ical c hoice ˜ k opt in tro duced in Section 2 . W e sim ulated 500 training datasets fr om Poisson sample-size mo dels for selected pairs of inte nsit y constan ts ( µ, ν ). Eac h dataset was ob- tained as follo ws. First, we generated a random num b er, say , N , from a P oisson distribution w ith mean µ + ν . Then, w e drew N indep enden t data from the densit y λ ( z ) = { µf ( z ) + ν g ( z ) } / ( µ + ν ); let these b e Z 1 , . . . , Z N . F or i = 1 , . . . , N , w e marked “t yp e X ” or “t yp e Y ” on Z i with resp ectiv e probabilities ψ ( Z i ) = µf ( Z i ) / { µf ( Z i ) + ν g ( Z i ) } and 1 − ψ ( Z i ). An equiv a- len t wa y of doing this would b e to draw N ind ep endent d ata, eac h of whic h is sampled from the densit y f or g with resp ectiv e probabilities µ/ ( µ + ν ) and ν / ( µ + ν ). Eac h datum would then b e mark ed “t yp e X ” if it was from f , and “t yp e Y ” otherwise. W e to ok ( µ, ν ) = (100 , 100) and (100 , 200) and considered the cases d = 1 , 2 . F or d = 1, w e c h ose f to b e the density function of N ( − 0 . 5 , 1) and g to b e the density fun ction of N (0 . 5 , 1). F or d = 2 , w e consid ered tw o pairs of den s ities. On e was ( f , g ), where f ∼ N 2 ((0 . 5 , − 0 . 5) , I 2 ) and g ∼ N 2 (( − 0 . 5 , 0 . 5) , I 2 ). Here, I d is the d × d iden tit y matrix. The other w as a pair of b iv ariate n ormal densities, as in th e first case bu t with correlat ion ρ = 0 . 5. F or eac h z , we ev aluated b P Po is ( z classified as t yp e X ) = 1 500 (# training samples that classify z as t yp e X ) . The error rate was then estimated b y the form u la d Err = µ µ + ν Z R f ( z ) { 1 − b P Po is ( z classified as type X ) } dz + ν µ + ν Z R g ( z ) b P Po is ( z classified as type X ) dz . W e to ok R = [ − 2 . 5 , 2 . 5] d , w hic h co vere d most of the sampling region. T o see the effect of the b o otstrap resampling fraction on the p erformance of ˜ k opt , 10 P . HALL, B. U. P ARK AND R . J. SAMWOR TH the three choi ces r = 1 / 3 , 1 / 2 , 2 / 3 w ere considered, wh ere r w as defined in Section 2.4 . F or computation of ˜ k opt , 100 b o otstrap resamples w ere d ra wn . T able 1 sh o ws the estimated error rates of the k -nearest n eigh b or classifier with k opt , and the k -n earest neigh b or classifier with ˜ k opt , for eac h sim ulation setting. Here, k opt denotes the v alue of the deterministic k that minimized the estimated error rate of the k -nn classifier. Th e Mon te Carlo sampling v ariabilit y of the estimated error rates can b e measur ed b y s . e . ( d Err) = q d Err(1 − d Err) / 500 . It is seen that the empirical c hoice ˜ k opt w orks particularly w ell. T he error rates of the k -nn classifier w ith ˜ k opt are not far from the error rate of the corresp onding classifier with k opt . Th e int erv al d Err ± s . e . ( d Err), where d Err is the estimated error rate of the k -nn classifier with ˜ k opt , con tains the optimal error rate ac hiev ed b y the corresp ond ing classifier with k opt , except in the correlate d case with ( µ, ν ) = (100 , 200). F or the latter case, confid ence in terv als with t wo standard errors includ e the corresp onding optimal v alue. Ov erall, the subsampling fraction r = 1 / 3 ga ve the b est results. How ev er, the err or rate do es n ot c h ange muc h for d ifferen t c h oices of r ; the differences are not statistic ally significan t. This suggests that ˜ k opt ma y not b e sensitiv e to the c h oice of the resampling fraction. In the sim u lations w e tried other p opulations with differen t mean v ectors and co v ariance matrices. Also, w e tried other training sample sizes. Th e lessons that we learned from the other sim u lation settings we re basically the same as those obtained fr om T able 1 . T able 1 also suggests that the optimal choice k opt for the case µ 6 = ν tends to b e smaller than the one for µ = ν . Our th eory for the rate of k opt also w as eviden t empirically . F or example, we found that k opt c hanged from 27 to 71 when ( µ, ν ) increased from (100 , 200) to (400 , 800) in the case corre- sp ondin g to the b ottom ro w of T able 1 . T he rate of increase in this case T able 1 Err or r ates of classifiers b ase d on 500 tr aining datasets fr om Poisson sample-size mo dels with intensity λ = µf + ν g , wher e f and g ar e densities of normal distributions as sp e cifie d i n the text. Her e, r denotes the subsampling fr action that app e ars in Se ction 2.4 k -n n with k opt k -n n with ˜ k opt d ( µ, ν ) ρ Ba y es k opt r = 1 / 3 r = 1 / 2 r = 2 / 3 1 (100 , 100) 0.3072 103 0.31 19 0.3119 0.3118 0.3120 (100 , 200) 0.2685 6 1 0.273 5 0.2759 0.2784 0.2814 2 (100 , 100) 0 0.2371 7 1 0.244 4 0.2445 0.2450 0.2454 0 . 5 0 .1566 39 0.1654 0 .1682 0.1708 0.1731 (100 , 200) 0 0.2125 45 0.2199 0.223 6 0.2274 0.2310 0 . 5 0 .1430 27 0.1514 0 .1684 0.1784 0.1870 NEAREST-NEIGHBOR CLASSIFICA TION 11 w as 71 / 27 = 2 . 63, which was roughly consisten t with the theoretical v alue 4 4 / (2+4) = 2 . 52. T o obtain similar empirical evidence in higher-dimensional feature s p aces, w e considered a case with d = 16. W e sim ulated 500 training datasets from Poisso n sample-size mo d els, with f and g b eing the den- sities of N 16 ((0 . 25 , . . . , 0 . 25) , I 16 ) and N 16 (( − 0 . 25 , . . . , − 0 . 25) , I 16 ), resp ec- tiv ely , wh en ( µ, ν ) = (100 , 200) and (10000 , 20000). Th e relativ e increase of k opt in this case w as 61 / 25 = 2 . 44, wh ic h is not far f rom its theoretic al v alue 100 4 / (16+4) = 2 . 51. 4. Pro of of Theorem 1 . Let S ε denote the set of p oin ts in R that are distan t no fu r ther than ε > 0 f r om S . W rite R \ S ε for the set of p oin ts in R th at are not in S ε . Using Marko v’s inequalit y , it can b e shown th at, for eac h fixed C, ε > 0 , we ha ve, as ν → ∞ , P Po is ( z classified by k -nn ru le as type X ) = I { ψ ( z ) > 1 2 } + O ( ν − C ) , (4.1) uniformly in z ∈ R \ S ε . By letting ε = ε ( ν ) con ve rge to zero sufficien tly slo wly in ( 4.1 ), we ensu re that that r esult remains tru e for decreasingly small ε . W e need ( 4.1 ) only when C = 1 , and for ε ( ν ) decreasing sufficien tly slo wly to zero. This v ersion of ( 4.1 ) implies that Z R\S ε g ( z ) P Po is ( z classified by k -n n ru le as typ e X ) (4.2) = Z R\S ε g ( z ) I { ψ ( z ) > 1 2 } dz + O ( ν − 1 ) . In view of ( 4.1 ) and ( 4.2 ), p rop erties of f and g a w a y from S ε do not af- fect the size of regret up to an y p olynomial order. Hence, there is n o loss of generalit y in w orking with distribu tions for wh ic h f and g ha v e tw o con tin- uous deriv ativ es on R ε , r ather than simply on S ε . T h is simplifies notation, and so w e sh all mak e the assumption b elo w without further comment. Giv en z ∈ R , let Z (1) , Z (2) , . . . denote the p oin t lo cations of the marked p oint pr o cess P , ordered suc h that k z − Z (1) k ≤ k z − Z (2) k ≤ · · · ; let z (1) , z (2) , . . . represen t p articular v alues of Z (1) , Z (2) , . . . , r esp ectiv ely; and put ~ z = ( z (1) , . . . , z ( k ) ) and ~ Z = ( Z (1) , . . . , Z ( k ) ). Denote b y Π Po is ( ~ z , k ) the p robabilit y , conditional on ~ Z = ~ z , that among the p oin ts z (1) , . . . , z ( k ) there are at least 1 2 k p oin ts with marks X . W e ma y write Π Po is ( ~ z , k ) = P k X i =1 J i ≥ 1 2 k ! , where J 1 , . . . , J k are ind ep endent zero-one v ariables, P ( J i = 1) = q i ≡ ψ ( z ( i ) ) , ψ = µf µf + ν g . (4.3) 12 P . HALL, B. U. P ARK AND R . J. SAMWOR TH T o aid in terpretation of ( 4.3 ), note that, since w e are here conditioning on ~ Z = ~ z , P ( J i = 1) = P ( J i = 1 | Z ( i ) = z ( i ) ). Note that, uniformly in 1 ≤ i ≤ k ∈ [ k 1 ( ν ) , k 2 ( ν )], E { ψ ( Z ( i ) ) } = ψ ( z ) + d X j =1 E ( Z ( i ) − z ) ( j ) ψ j ( z ) + 1 2 d X j 1 =1 d X j 2 =1 E { ( Z ( i ) − z ) ( j 1 ) ( Z ( i ) − z ) ( j 2 ) } ψ j 1 j 2 ( z ) (4.4) + o ( E k Z ( k ) − z k 2 ) , where ( Z ( i ) − z ) ( j ) denotes the j th comp onen t of Z ( i ) − z , ψ j ( z ) = ( ∂ /∂ z ( j ) ) ψ ( z ) and ψ j 1 j 2 ( z ) = ( ∂ 2 /∂ z ( j 1 ) ∂ z ( j 2 ) ) ψ ( z ). T o obtain ( 4.4 ), we hav e used ( 2.4 ), whic h implies that, for suffi cientl y sm all ε > 0, f and g ha ve t wo con tinuous deriv ativ es on R ε , the latter denoting the set of all p oin ts in R d that are distan t no further than ε from some p oin t in R . It follo ws from this result that, under ( 2.4 ), the probability that Z i = Z i ( z ) ∈ R ε , for all 1 ≤ i ≤ k 2 ( ν ) and all z ∈ R , equals 1 − O ( ν − C ) for all C > 0. This implies th e T a ylor ex- pansion of ψ ( Z ( i ) ) that leads to ( 4.4 ), and, in com bination with the moment condition in ( 2.4 ), ensures the correctness of th e remainder term in ( 4.4 ). Under the conditions of Theorem 1 , E k Z ( k ) − z k 2 = O { ( k /ν ) 2 /d } , and s o ( 4.4 ) implies that k X i =1 { E ψ ( Z ( i ) ) − ψ ( z ) } = k X i =1 d X j =1 E ( Z ( i ) − z ) ( j ) ψ j ( z ) (4.5) + 1 2 k X i =1 d X j 1 =1 d X j 2 =1 E { ( Z ( i ) − z )( Z ( i ) − z ) T } j 1 j 2 ψ j 1 j 2 ( z ) + o { k ( k /ν ) 2 /d } . Since R is compact and the remainder in ( 4.5 ) is of the stated order for eac h z ∈ R , then the remainder is of that order uniformly in z . W riting τ = ( µ/ν ) f + g and κ ( u, z ) = R v : k v k≤k u k τ ( z + v ) dv , we see that the d ensit y of Z ( i ) − z at u is f i ( u, z ) = ν τ ( z + u ) { ν κ ( u, z ) } i − 1 ( i − 1)! e − ν κ ( u,z ) . NEAREST-NEIGHBOR CLASSIFICA TION 13 Therefore, k X i =1 E ( Z ( i ) − z ) = ν Z uτ ( z + u ) P { W ( u, z ) ≤ k − 1 } du, (4.6) k X i =1 E { ( Z ( i ) − z )( Z ( i ) − z ) T } = ν Z uu T τ ( z + u ) P { W ( u, z ) ≤ k − 1 } du, (4.7) where the rand om v ariable W ( u, z ) is Poi sson-distributed with mean ν κ ( u, z ), and the in tegrals are o v er R d . In ( 4.6 ) and ( 4.7 ) we shall mak e the c h ange of v ariable u =  k ν a d τ ( z )  1 /d v . (4.8) If ε 1 > 0 is chose n so small that ν − 3 ε 1 { ν /k 2 ( ν ) } 2 /d → ∞ , then, with v d efined b y ( 4.8 ), and t j = ν − j ε 1 { ν /k 2 ( ν ) } 2 /d , w e h a ve , for all sufficientl y large ν , and for all k v k > t 1 and all k ∈ [ k 1 ( ν ) , k 2 ( ν )], k − 1 − E { W ( u, z ) } { E W ( u, z ) } 1 / 2 ≤ k − 1 − k t 2 ( k t 2 ) 1 / 2 ≤ − 1 2 ( k t 2 ) 1 / 2 = − 1 2 ( k ν ε 1 t 3 ) 1 / 2 . It follo ws that, for all su ffi cien tly large ν , and f or all k v k ≥ t 1 , all k ∈ [ k 1 ( ν ) , k 2 ( ν )] and eac h C > 0 , P { W ( u, z ) ≤ k − 1 } ≤ P  − W ( u, z ) − E W ( u, z ) { E W ( u, z ) } 1 / 2 ≥ 1 2 ( k ν ε 1 t 3 ) 1 / 2  ≤ (4 /ν ε 1 ) C E      W ( u, z ) − E W ( u, z ) { E W ( u, z ) } 1 / 2     2 C  (4.9) ≤ C 1 ν − C ε 1 , where C 1 > 0 dep ends only on C . Here w e hav e used the fact that W ( u, z ) is Poisson-distributed with a mean th at is b ounded b elo w b y 1 for k v k ≥ t 1 and large ν . Com b ining ( 4.5 ), ( 4.6 ), ( 4. 7 ) and ( 4.9 ), and noting that the distribution of W ( u, z ) is symm etric in u , we deduce that k X i =1 { E ψ ( Z ( i ) ) − ψ ( z ) } = ν Z u : k v k≤ t 1 ˙ ψ ( z ) T u { τ ( z + u ) − τ ( z ) } P { W ( u, z ) ≤ k − 1 } du (4.10) + 1 2 ν Z u : k v k≤ t 1 u T ¨ ψ ( z ) uτ ( z + u ) P { W ( u, z ) ≤ k − 1 } du + o { k ( k /ν ) 2 /d } , 14 P . HALL, B. U. P ARK AND R . J. SAMWOR TH uniformly in z ∈ R . W riting ˙ τ = ( τ 1 , . . . , τ d ) T , d efining ˙ ψ an d ˙ τ analogously , d efining ¨ ψ = ( ψ ij ), a d × d m atrix, and taking T to b e the set of v suc h that k v k ≤ t 1 , and T ′ to b e the corresp ond ing set of u , give n by ( 4.8 ), we dedu ce from ( 4.10 ) that k X i =1 { E ψ ( Z ( i ) ) − ψ ( z ) } = ν Z T ′ { ˙ ψ ( z ) T uu T ˙ τ ( z ) + 1 2 u T ¨ ψ ( z ) uτ ( z ) } × P { W ( u, z ) ≤ k − 1 } du + o { k ( k /ν ) 2 /d } = ν Z T ′ " d X j =1 ( u ( j ) ) 2 { ψ j ( z ) τ j ( z ) + 1 2 ψ j j ( z ) τ ( z ) } # (4.11) × P { W ( u, z ) ≤ k − 1 } du + o { k ( k /ν ) 2 /d } = { k /a d τ ( z ) }{ k /ν a d τ ( z ) } 2 /d × Z T " d X j =1 ( v ( j ) ) 2 { ψ j ( z ) τ j ( z ) + 1 2 ψ j j ( z ) τ ( z ) } # × P { W ( u, z ) ≤ k − 1 } dv + o { k ( k /ν ) 2 /d } , uniformly in z ∈ R . T o con trol the v alue of P { W ( u, z ) ≤ k − 1 } in ( 4.11 ), we shall use a normal appro ximation to the distribution of a P oisson random v ariable with large mean, and a crud e b ound to that distrib ution w hen the mean is small. Sp ecifically , let Z ζ ha ve a Po isson distribution with mean ζ . Th en, for eac h C > 0, there exists a constan t C 1 = C 1 ( C ) > 0 suc h that, when ever ζ ≥ 0, (a) f or ζ ≥ 1, sup −∞ 0 (1 + | x | ) C P ( Z ζ > x ) ≤ C 1 ζ . (4.12) Since κ ( u, z ) = a d τ ( z ) k u k d { 1 + O ( k u k 2 ) } as k u k → 0, uniformly in z ∈ R , then if u is giv en by ( 4.8 ), ν κ ( u, z ) = k k v k d [1 + O { ( k /ν ) 2 /d k v k 2 } ], uniformly in z ∈ R . It follo ws that k − 1 − E W ( u, z ) { E W ( u, z ) } 1 / 2 = k 1 / 2 (1 − k v k d ) k v k d/ 2 [1 + O { ( k /ν ) 2 /d k v k 2 } ] + O { k 1 / 2 ( k /ν ) 2 /d k v k 2+( d/ 2) + k − 1 / 2 k v k − d/ 2 } . Noting that P { W ( u, z ) ≤ k − 1 } = P  W ( u, z ) − E W ( u, z ) { v ar W ( u, z ) } 1 / 2 ≤ k − 1 − E W ( u, z ) { E W ( u, z ) } 1 / 2  , (4.13) NEAREST-NEIGHBOR CLASSIFICA TION 15 using ( 4.12 )(a) to pro du ce an app ro ximation to the righ t-hand side of ( 4. 13 ) when k − 1 /d ≤ k v k ≤ t 1 , and using ( 4.12 )(b) f or the same purp ose wh en k v k ≤ k − 1 /d , we deduce f rom ( 4.11 ) that k X i =1 { E ψ ( Z ( i ) ) − ψ ( z ) } = k ( k /ν ) 2 /d α 1 ( z ) + o { k ( k /ν ) 2 /d } , (4.14) uniformly in z ∈ R , where α 1 ( z ) ≡ { a d τ ( z ) } − 1 − (2 /d ) × Z k v k≤ 1 " d X j =1 ( v ( j ) ) 2 { ψ j ( z ) τ j ( z ) + 1 2 ψ j j ( z ) τ ( z ) } # dv (4.15) = { a d τ ( z ) } − 1 − (2 /d ) d − 1 " d X j =1 { ψ j ( z ) τ j ( z ) + 1 2 ψ j j ( z ) τ ( z ) } # × Z k v k≤ 1 ( d X j =1 ( v ( j ) ) 2 ) dv , the latter b eing iden tical to α ( z ), defined at ( 2.6 ), except that th ere, ψ and τ are replaced by th eir resp ectiv e limits, ρ and λ . In our pr o ofs thr oughout Section 4.1, it is con venien t to wor k not with S but with the lo cus S ν of p oin ts z 0 suc h that µf ( z 0 ) µf ( z 0 ) + ν g ( z 0 ) = 1 2 . (4.16) [In this notatio n S ∞ = lim ν → ∞ S ν is the set of z 0 suc h that ρ ( z 0 ) = 1 2 .] W e shall su ppress the subscript on S ν , h o wev er, instead sho wing at the end of the pro of [see the argument b elo w ( 4.23 )] that the transition fr om S = S ν to S ∞ is element ary . W e wish to d ev elop an app ro ximation to K ν ( g ) ≡ Z S ε g ( z ) P Po is ( z classified b y k -nn rule as t yp e X ) dz (4.17) = Z S ε g ( z ) E { Π Po is ( ~ Z , k ) } dz . If we reint erpret J 1 , . . . , J k as random v ariables w ith distributions dep endin g on ~ Z , indep enden t conditional on ~ Z , and satisfying P ( J i = 1 | ~ Z ) = ψ ( Z ( i ) ), then E { Π Po is ( ~ Z , k ) } = P k X i =1 J i ≥ 1 2 k ! . (4.18) 16 P . HALL, B. U. P ARK AND R . J. SAMWOR TH Let T z 0 b e the infinite line p erp endicular to S at z 0 , and let u denote a p oint on T z 0 . No w , T z 0 has t wo “halv es,” one in the direction where ψ ( u ) im- mediately increases ab o v e 1 2 as u is mov ed aw a y from z 0 , and the other wh ere ψ ( u ) immediately decreases b elo w 1 2 . C all these T z 0 + and T z 0 − , resp ectiv ely . Note that T z 0 = { z 0 + t ˙ ψ ( z 0 ) : − ∞ < t < ∞} and T z 0 + = { z 0 + t ˙ ψ ( z 0 ) : 0 < t < ∞} . Put µ k ( z ) = P i ≤ k E ( J i ), σ k ( z ) 2 = v ar( P i ≤ k J i ), W k ( z ) = { P i ≤ k J i − µ k ( z ) } / σ k ( z ) and χ ( z ) = I { ψ ( z ) ≤ 1 2 } . Assume ε ↓ 0 and k 1 ( ν ) 1 / 2 ε → ∞ as ν → ∞ . Then, σ k ( z ) 2 = 1 4 k { 1 + o (1) } , uniformly in z ∈ S ε , (4.19) as ν → ∞ . By ( 4.17 ) and ( 4.18 ), K ′ ν ( g ) ≡ K ν ( g ) − Z S ε g ( z )(1 − χ )( z ) dz = Z S ε g ( z )[ P { W k ( z ) > w k ( z ) } − (1 − χ )( z )] dz , K ′ ν ( f ) ≡ K ν ( f ) − Z S ε f ( z ) χ ( z ) dz = Z S ε f ( z )[ P { W k ( z ) ≤ w k ( z ) } − χ ( z )] dz , where w k ( z ) = − 1 σ k ( z ) k X i =1  E ψ ( Z ( i ) ) − 1 2  . Hence, K ′′ ν ≡ µ µ + ν K ′ ν ( f ) + ν µ + ν K ′ ν ( g ) (4.20) = Z S ε  µf ( z ) − ν g ( z ) µ + ν  [ P { W k ( z ) ≤ w k ( z ) } − χ ( z )] dz . In view of ( 4.19 ), a standard application of the nonuniform version of the Berry–Esseen theorem to the sum of indep endent r andom v ariables repre- sen ted by W k ( z ) implies that, for eac h C > 0 , sup z ∈S ε sup −∞ 0, K ′′ ν = Z S ε  µf ( z ) − ν g ( z ) µ + ν  [Φ { w k ( z ) } − χ ( z )] dz (4.21) + O  k − 1 / 2 Z S ε     µf ( z ) − ν g ( z ) µ + ν     { 1 + | w k ( z ) |} − C dz  . NEAREST-NEIGHBOR CLASSIFICA TION 17 Using ( 4.14 ), ( 4.15 ) and ( 4.19 ), it can b e sho wn that, if we take z = z 0 + k − 1 / 2 u , with z 0 ∈ S and u giv en b y z 0 + k − 1 / 2 u ∈ T z 0 , then − w k ( z ) = { 1 + o (1) } 2 k − 1 / 2 k [ ψ ( z ) − 1 2 + ( k /ν ) 2 /d α 1 ( z ) + o { ( k /ν ) 2 /d } ] = { 1 + o (1) } 2 " d X j =1 u ( j ) ψ j ( z 0 ) + k 1 / 2 ( k /ν ) 2 /d α 1 ( z 0 ) × o {k u k + k 1 / 2 ( k /ν ) 2 /d } # , uniformly in z ∈ S ε . Hence, writing U z 0 = T z 0 − z 0 , U z 0 ± = T z 0 ± − z 0 , and c k = k 1 / 2 ( k /ν ) 2 /d , we obtain from ( 4.21 ) k K ′′ ν = Z S Z U z 0 { p ˙ f ( z 0 ) − (1 − p ) ˙ g ( z 0 ) } T × u (Φ[ − 2 { ˙ ψ ( z 0 ) T u + c k α 1 ( z 0 ) } ] − I ( u ∈ U z 0 − )) du dz 0 + o (1 + c 2 k ) = Z S Z ∞ −∞ { p ˙ f ( z 0 ) − (1 − p ) ˙ g ( z 0 ) } T ˙ ψ ( z 0 ) k ˙ ψ ( z 0 ) k − 1 (4.22) × t (Φ[ − 2 {k ˙ ψ ( z 0 ) k t + c k α 1 ( z 0 ) } ] − I ( t < 0)) dt dz 0 + o (1 + c 2 k ) = C 1 ( S ) + C 2 ( S ) c 2 k + o (1 + c 2 k ) , where, to obtain the second identit y , we tak e u = t ˙ ψ ( z 0 ) / k ˙ ψ ( z 0 ) k . In ( 4.22 ), C 1 ( S ) and C 2 ( S ) ha ve the definitions at ( 2.7 ), except that here S is int er- preted as the set of p oin ts z 0 for which ( 4.16 ) h olds. Com b ining ( 4.2 ) and ( 4.22 ), we deduce that risk Po is k - nn − risk Po is Ba ye s = C 1 ( S ) k − 1 + C 2 ( S )( k /ν ) 4 /d + o { k − 1 + ( k /ν ) 4 /d } . (4.23) Under the conditions assumed in T heorem 1 , µ/ ( µ + ν ) → p as ν → ∞ , from whic h it follo ws that C 1 ( S ) and C 2 ( S ) con v erge to the v alues they w ould tak e if w e were to d efine S as the set of p oin ts z 0 for w h ic h, instead of ( 4.16 ), pf ( z 0 ) / { pf ( z 0 ) + (1 − p ) g ( z 0 ) } = 1 2 . This is the defin ition u sed for S at ( 2.7 ). Note to o that ψ → ρ and τ → λ as ν → ∞ , an d that these limits arise in a very simple wa y . F or example, τ = ( µ/ν ) f + g con verges to λ = p (1 − p ) − 1 f + g since µ/ν → p (1 − p ) − 1 ; the fun ctions f and g remain fixed. Since C 1 ( S ) and C 2 ( S ) con v erge to their v alues at ( 2.7 ), th en Theorem 1 follo ws from ( 4.23 ). Ac kn o wledgment s. W e are grateful to the review ers for h elpful com- men ts. 18 P . HALL, B. U. P ARK AND R . J. SAMWOR TH REFERENCES Audiber t, J.-Y. and Tsybak o v, A. B. (2007). F ast learning rates for plug-in classifiers under t he margin condition. A nn. Statist. 35 608–633 . MR233686 1 Bax, E. (2000). V alidation of nearest neighbor classi fiers. IEEE T r ans. Inform. The ory 46 2746–2752 . MR180740 4 Co ver, T. M. (1968). Rates of conv ergence for n earest neighbor pro cedures. I n Pr o c e e d- ings of the Hawaii International Confer enc e on System Scienc es (B. K . Kinariw ala and F. F. Kuo, eds.) 413– 415. Univ. Haw aii Press, H onolulu. Co ver, T. M. and Har t, P. E. (1967 ). Nearest neighbor pattern classification. IEEE T r ans. Inform. The ory 13 21–27. Devro ye, L. (1981). On the asymptotic p robabilit y of error in nonparametric discrimi- nation. Ann. Statist. 9 1320–13 27. MR063011 4 Devro ye, L., Gy ¨ orfi, L., Krz y ˙ zak, A. and Lugosi, G. (1994). On the strong univers al consistency of nearest neighbor regression function estimates. Ann. Statist. 22 1371– 1385. MR131198 0 Devro ye, L., Gy ¨ orfi, L. and Lugosi, G. (1996). A Pr ob abilistic The ory of Pattern R e c o gnition . S p ringer, New Y ork. MR138309 3 Devro ye, L. and W agner, T. J. (1977). The strong uniform consistency of nearest neigh b or densit y estimates. An n. Statist. 5 536–540. MR043644 2 Devro ye, L. and W agner, T. J. (1982). Nearest neighbor methods in discrimination. In Classific ation, Pattern R e c o gnition and Re duc tion of D i mensionality . Handb o ok of Statistics 2 (P . R. Krishnaiah and L. N . Kanal, eds.) 193–19 7. North-Holland, Amster- dam. MR071670 6 Gy ¨ orfi, L. and Gy ¨ orfi, Z. (1978). An upp er b ound on the asymptotic error probabilit y of the k - nearest neigh b or ru le for m u ltiple class es. IEEE T r ans. Inform. The ory 24 512–51 4. MR050159 6 Gy ¨ orfi, L., Ko hler, M., Krzy ˙ zak, A. and W alk, H. (2002). A Di stribution-F r e e The- ory of Nonp ar ametric R e gr ession . Springer, New Y ork. MR192039 0 Fix, E. and Hodges, J. L., Jr. (1951). Discriminatory analysis, nonp arametric discrim- ination, consistency properties. R andolph Field, T exas, Pro ject 21-49-004, Rep ort No. 4. Fritz, J. ( 1975). Distribution-free exp onential error b ound for n earest neighbor pattern classificati on. IEEE T r ans. Inform. The ory 21 552–55 7. MR 0395379 Gy ¨ orfi, L. ( 1978). On the rate of conv ergence of nearest neighbor rules. IEEE T r ans. Inform. The ory 24 509–512. MR050159 5 Gy ¨ orfi, L. (1981). The rate of converg ence of k − N N regressi on estimates and classifi- cation rules. IEEE T r ans. Inform. The ory 27 362–364. MR061912 4 Hall, P. and Kan g, K.-H. (2005). Bandwidth choice for nonparametric classification. An n. Statist. 33 284–306. MR215780 4 Hall, P., P ark, B. U. and Samw or th, R. J. (2007). Choic e of neigh b our order for nearest-neigh b our classificatio n rule. Av ailable at http://sta t.snu.ac.k r/theostat/papers/hps.pdf . Holst, M. and Irle, A. (2001). Nearest neighbor classi fication with dep end ent training sequences. Ann. Statist. 29 1424–1442 . MR187333 7 Kharin, Yu. S. (1982). Asymptotic exp ansions for t he risk of parametric and nonparamet- ric d ecision fun ctions. In T r ans actions of the Ninth Pr ague Confer enc e on Information The ory, Statistic al De cision F unctions, R andom Pr o c esses B 11–16. Reidel, Dordrech t. MR075790 0 NEAREST-NEIGHBOR CLASSIFICA TION 19 Kharin, Yu. S. and Ducin skas, K. (1979). The asymptotic exp ansion of the risk for classifiers using maximum likelihood estimates. Statist. Pr oblemy Upr avleniya—T rudy Sem. Pr otsessy Optimal. Upr avleniya V Sektsiya 38 77–93. (In Russian.) MR056556 4 Ko hler, M. and Kr y ˙ zak, A. (2006). On the rate of con verge nce of lo cal av eraging plug-in classificati on rules under a margin condition. Man uscript. Kulkarni, S. R. and Posner, S . E. (1995). R ates of converg ence of n earest n eigh- b or estimation under arbitrary sampling. IEEE T r ans. Inform. The ory 41 1028–1039. MR136675 6 Mammen, E. and T sybak ov, A. B. (1999). Smo oth discrimination analysis. Ann. Statist. 27 1808–1829 . MR176561 8 Marro n, J. S. (1983). Optimal rates on converg ence to Ba yes risk in nonp arametric discrimination. A nn. Statist. 11 1142–11 55. MR072026 0 Psal tis, D . , Snap p, R. R. and Venka tesh, S. S. (1994). On th e finite sample p erfor- mance of the nearest n eigh b or classifier. IEEE T r ans. Inform. The ory 40 820– 837. Raud ys, ˇ S. and Young, D. (2004). Results in statistical discriminant analysis: A review of the former Soviet Union literature. J. Multivariate Anal. 89 1–35. MR2041207 Snapp, R. R. and Ve nka tesh, S. S. (1998). Asymptotic expansion of the k nearest neigh b or risk. Ann. Statist. 26 850–878. MR163541 0 W agner, T . J. (1971). Con vergence of th e nearest neighbor rule. IEEE T r ans. Inform. The ory 17 566–571. MR029882 9 P. Hall Dep ar tment of Ma them atic s and St a tistics University of Melbourne P arkville, VIC 301 0 Aus tralia B. U. P ark Dep ar tment of St a tistics Seoul Na tional University Seoul 151–74 7 Ko rea E-mail: bupark2000@gmail.com R. J. Samwor th St a tistical Lab ora tor y Centre for Mat hema tical Sciences University of Cambridge Wilberforc e Ro ad Cambridge, CB3 0WB United King dom

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment