Learning Low-Density Separators

We define a novel, basic, unsupervised learning problem - learning the lowest density homogeneous hyperplane separator of an unknown probability distribution. This task is relevant to several problems in machine learning, such as semi-supervised lear…

Authors: Shai Ben-David, Tyler Lu, David Pal

Learning Lo w-Densit y Separators Shai Ben-David 1 , Tyler Lu 1 , D´ avid P´ al 1 , and Miroslav a Sot´ ak ov´ a 2 1 David R. Cheriton School of Computer Science Universit y of W a terlo o, ON, Canada { shai,ttl u,dpal } @cs.uwater loo.ca 2 Department of Comput er Science Universit y of Aarhus, Denmark mirka@cs.a u.dk Abstract. W e defin e a nov el, basic, un su pervised learning problem - learning the low est d ensit y homogeneous hyp erp lane separator of an unkn o wn probability distribution. This task is relev ant to sev eral problems in machine learning, such as semi-supervised learning and clustering stability . W e inv estigate the q uestion of existence of a un iv ersally consisten t algorithm for t h is problem. W e prop ose tw o natural learning paradigms and prov e that, on input unlab eled random samples generated by any memb er of a rich family of distributions, they are guaranteed to conv erge t o the optimal separator for that distribution. W e complemen t this result by sho wing that no learn- ing algori th m for our task can achiev e uniform learning rates (that are indep endent of t h e data generating distribution). 1 In tro duction While th e theory of machine learning has achieved extensive understanding of man y aspects of superv is ed learning, our theore tical understanding of unsupe rvised learning leav es a lot to b e desired. In spite of the obvious pr actical impor tance of v arious unsupe r vised learning tasks , the state of our cur r en t knowledge do es not provide a n ything that comes close to the rigor ous mathematical per formance g uarantees that classification prediction theory enjoys. In this pap er we make a small step in that direction b y a na lyzing one specific unsupe rvised learning task – the detection of low-densit y linear separ ators for data distributions ov er Euclidean spaces. W e co nsider the following task: for an u nknown data distribution over R n , fin d t he homo gene ous hyp erplane of lowest density that cuts t hr ough that distribution. W e a ssume that the underlying da ta distribution has a contin uous density function and that the da ta av ailable to the lea r ner are finite i.i.d. samples of that distribution. Our mo del can b e viewed as a restricted ins ta nce of the fundamental issue o f inferring information ab out a pr obability distribution from the random samples it generates. T asks of that nature r a nge fro m the ambitious problem of density estimation [8], through estimation of level sets [4], [13], [1], densest region detection [3], and, of course, clustering . All of these tas ks are notorio usly difficult with re s pect to b oth the sample co mplexit y a nd the computational complexity asp ects (unless one pres umes strong restrictions ab out the nature of the underlying data distribution). Our task s eems mor e mo dest than these. Although we ar e no t aw are of any previous work on this problem (from the p oint of view of statistical machine learning , at le ast), we b elieve that it is a rather basic problem that is relev ant to v a rious practical learning scenar io s. One imp ortant do main to which the detection of low-density linea r data separator s is re lev ant is semi-sup ervised lea rning [7]. Semi- s upervis ed lear ning is mo tiv ated by the fa ct that in ma ny r eal world classification pro blems, unlabe le d samples are m uch cheaper and easier to o btain than labeled exa mples. Consequently , there is g r eat incen tive to develop tools by whic h such unlabeled samples ca n be utilized to improv e the quality o f sample based classifiers. Naturally , the utility of unla b eled data to classification depe nds on a s suming some relationship b et ween the unlab eled data distribution and the class member - ship o f da ta p oint s (see [5] for a rigo rous discussion of this po int). A commo n p ostulate of that t yp e is that the bo undary b et ween data classes passes thro ugh low-densit y r egions of the data distribution. The T ransductive Supp o rt V ector Machines pa r adigm (TSVM) [9] is an example of an algorithm that implicitly uses such a low density bo unda ry as s umption. Roughly s peaking , TSVM se a rches for a hyper- plane that has small er ror on the labeled data and at the same time has wide mar gin with re s pect to the unlab eled data sample. Another area in which low-density b oundaries play a significant r ole is the analysis of c lustering stability . Recent work on the analysis o f cluster ing stability found clo se relationship be tween the stability of a clustering and the data density a long the cluster b o undaries – roughly sp eaking, the low er these densities the mor e s table the clustering ([6], [12 ]). A low-density-cut algorithm for a family F of proba bilit y distributions takes as an input a finite sample generated by some distribution f ∈ F and ha s to output a hyperpla ne through the or igin with low densit y w.r.t. f . In particula r, we consider the family of a ll distributions over R n that hav e contin uous density functions. W e inv estig ate tw o no tio ns of success for low-densit y-cut alg o rithms – uniform conv ergence (over a family of probability distributions) a nd consis tency . F or unifor m conv er gence we prov e a g eneral negative result, showing that no algo r ithm can gua rant ee any fixed c on vergence rates (in terms of sample sizes). This negative result holds even in the simplest case where the data doma in is the o ne-dimensional unit in terv al. F or consistency (e.g., allowing the lea r ning/conv erg ence rates to depe nd on the data-generating distribution), we prov e the success of tw o natura l algorithmic para digms; Soft-Mar gin algo rithms tha t c ho ose a margin par ameter (depending on the sample size) and output the separato r with lowest empirical weigh t in the mar gins a round it, and Har d-Mar gin algor ithms that cho o se the separa tor with wides t sample-free marg ins. The pa per is organized as follows: Section 2 provides the formal definition of our lea rning task as well as the succe ss criteria that we in vestigate. In Section 3 we present tw o na tural learning paradig ms for the problem ov er the real line and prov e their universal c onsistency ov er a rich class of pro babilit y distributions. Section 4 extends these r esults to show the learnability of low est-dens it y homo g eneous linear cuts for pro ba bilit y distributions ov er R d for arbitra ry dimension, d . In Sec tio n 5 we show that the pre v ious univ er sal consistency results canno t be improv ed to o btain uniform learning ra tes (b y a ny finite-sample based algo rithm). W e conclude the pap er with a discus sion o f directio ns for further research. 2 Preliminaries W e consider pro babilit y distributions ov er R d . F or concr eteness, let the domain of the distributio n b e the d -dimensional unit ba ll. A line ar cu t le arning algorithm is an algo rithm that takes as input a finite set of domain po in ts, a sample S ⊆ R d , a nd outputs a homoge no us h yp erplane, L ( S ) (determined by a weight vector, w ∈ R d , such that || w || 2 = 1). W e inv estigate algorithms that aim to detect hyperplanes with low dens it y with resp ect to the s a mple- generating probability distribution. Let f : R d → R + 0 be a d -dimens io nal density function. W e a ssume that f is co n tinuous. F or any homogeneous hyperplane h ( w ) = { x ∈ R d : w T x = 0 } defined by a unit weigh t vector w ∈ R d , we consider the ( d − 1)-dimensional integral of the densit y over h , f ( w ) := Z h ( w ) f ( x ) d x . Note that w 7→ f ( w ) is a contin uous mapping defined on the ( d − 1)-sphere S d − 1 = { w ∈ R d : k w k 2 = 1 } . Note that, for a n y such w eight v ector w , f ( w ) = f ( − w ). F or the 1-dimensiona l case, these hyperpla nes are replace d by p oin ts, x on the real line, and f ( x ) = f ( x ) – the density at the p oint x . Definition 1. A linear cut learning alg o rithm is a function that maps samples to homo gene ous hyp er- planes. Namely, L : ∞ [ m =1 ( R d ) m → S d − 1 . When d = 1 , we r e quir e that L : ∞ [ m =1 R m → [0 , 1] . (The intention is that L finds the lowest density line ar sep ar ator of the s ample gener ating distribution.) Definition 2. L et µ b e a pr ob ability distribution and f its density function. F or a weight ve ctor w we define the half-sp ac es h + ( w ) = { x ∈ R d : w T x ≥ 0 } and h − ( w ) = { x ∈ R d : w T x ≤ 0 } . F or any weight ve ctors w and w ′ , 1. D E ( w , w ′ ) = 1 − | w T w ′ | 2. D µ ( w , w ′ ) = min { µ ( h + ( w ) ∆h + ( w ′ )) , µ ( h − ( w ) ∆h + ( w ′ )) } 3. D f ( w , w ′ ) = | f ( w ′ ) − f ( w ) | W e s hall mo stly co nsider the distance mea sure D E in R d , for d > 1 a nd D E ( x, y ) = | x − y | for x, y ∈ R . In theses cases we omit any explicit refere nc e to D . All of our r esults hold as well when D is taken to b e the probability mass of the symmetric difference b et ween L ( S ) and w ∗ and when D is taken to be D ( w , w ′ ) = | f ( w ) − f ( w ′ ) | . Definition 3. L et F denote a family of pr ob ability distributions over R d . We assume that al l memb ers of F have density functions, and identify a distribution with its density function. L et D denote a distanc e function over hyp erplanes. F or a line ar cut le arning algorithm, L , as ab ove, 1. We s ay t hat L , is consistent for F w.r.t a distanc e me asur e D , if, for any pr ob ability distribut ion f in F , if f attains a unique minimum density hyp erplane t hen ∀ ǫ > 0 lim m →∞ Pr S ∼ f m [ D ( L ( S ) , w ∗ ) ≥ ǫ ] = 0 . (1) wher e w ∗ is the minimum density hyp erplane for f . 2. We say that L is uniformly conv ergent for F (w.r.t a distanc e m e asure , D ), if, for every ǫ , δ > 0 , ther e exists a m ( ǫ, δ ) such t hat for any pr ob ability distribution f ∈ F , if f has a unique minimizer w ∗ then, for al l m ≥ m ( ǫ, δ ) we have Pr S ∼ f m [ D ( L ( S ) , w ∗ ) ≥ ǫ ] ≤ δ. (2) 3 The One Dimensional Pr oblem Let F 1 be the family of a ll proba bilit y distributions ov er the unit interv al [0 , 1] that have c on tinuous density function. W e co nsider tw o natura l algo rithms for lowest dens it y cut over this family . The first is a simple buck eting alg orithm. W e explain it in detail and show its consistency in section 3 .1. The second alg orithm is the har d-mar gin algo rithm whic h o utputs the mid- point of the larg est gap b et ween t wo consecutive p oints the sample. In sectio n 3.2 we show har d-mar gin alg orithm is consistent a nd in section 3.1 that the buc keting a lgorithm is consistent. In section 5 we show there are no alg o rithms that are uniformly convergen t for F 1 . 3.1 The Buc k eti ng Al gorithm The alg orithm is parameterized by a function k : N → N . F or a sample of size m , the alg orithm splits the interv a l [0 , 1] into k ( m ) equal length subinterv als ( buckets ). Given an input sa mple S , it c o un ts the nu mber of sa mple points lying in each buck et a nd o utputs the mid-p oint of the buck et with few est sample po in ts. In case of ties, it picks the r igh tmost buck et. W e denote this algorithm by B k . As it tur ns out, there exists a choice of k ( m ) which mak es the algo rithm B k consistent for F 1 . Theorem 1 . If the numb er of buckets k ( m ) = o ( √ m ) and k ( m ) → ∞ as m → ∞ , t hen the bu cketing algorithm B k is c onsistent for F 1 . Pr o of. Fix f ∈ F 1 , assume f has a unique minimizer x ∗ . Fix ǫ , δ > 0. Let U = ( x ∗ − ǫ/ 2 , x ∗ + ǫ/ 2) be a n neig hbo urho od of the unique minimizer x ∗ . The set [0 , 1] \ U is compact and hence there ex ists α := min f ([0 , 1] \ U ). Since x ∗ is the unique minimizer of f , α > f ( x ∗ ) and hence η := α − f ( x ∗ ) is po sitiv e. Thus, we ca n pic k a neig h b ourho o d V of x ∗ , V ⊂ U , such that for all x ∈ V , f ( x ) < α − η / 2. The assumptions on growth of k ( m ) imply that there exists m 0 such that for all m ≥ m 0 1 /k ( m ) < | V | / 2 (3) 2 r ln(1 /δ ) m < η 2 k ( m ) (4) Fix any m ≥ m 0 . Divide [0 , 1] in to k ( m ) buc kets each of length 1 /k ( m ). F or a ny buck et I , I ∩ U = ∅ , µ ( I ) ≥ α k ( m ) . (5) Since 1 /k ( m ) < | V | / 2 there exists a buck et J such that J ⊆ V . F urthermore, µ ( J ) ≤ α − η / 2 k ( m ) . (6) F or a buck et I , we denote by | I ∩ S | the n umber of sample p oints in the buck et I . F ro m the well known V apnik-Chervonenkis bounds [2], we have that with proba bilit y at least 1 − δ over i.i.d. draws o f sample S of size m , fo r an y buck et I ,     | I ∩ S | m − µ ( I )     ≤ r ln(1 /δ ) m . (7) Fix any s a mple S satisfying the inequality (7) . F or any buc ket I , I ∩ U = ∅ , | J ∩ S | m ≤ µ ( J ) + r ln(1 /δ ) m by (7) ≤ α − η / 2 k ( m ) + r ln(1 /δ ) m by (6) < α k ( m ) − 2 r ln(1 /δ ) m + r ln(1 /δ ) m by (4) ≤ µ ( I ) − r ln(1 /δ ) m by (5) ≤ | I ∩ S | m by (7) Since | J ∩ S | > | I ∩ S | , the algo rithm B k m ust not output the mid-p oint o f any bucket I for which I ∩ U = ∅ . Henceforth, the alg orithm’s output, B k ( S ), is the mid-p oint of an buck et I which intersects U . Thus the es timate B k ( S ) differs from x ∗ by at most the sum of the r a dius of the neig h b ourho o d U and the radius of the bucket . Since the length of a bucket is 1 /k < | V | / 2 and V ⊂ U , the sum of the radii is | U | / 2 + | V | / 4 < 3 4 | U | < ǫ . Combining all the a bov e, w e hav e that for any ǫ, δ > 0 there exists m 0 such tha t for any m ≥ m 0 , with probabilit y a t least 1 − δ ov er the draw of an i.i.d. sa mple S of size m , | B k ( S ) − x ∗ | < ǫ . This is the same as saying that B k is consis tent for f . ⊓ ⊔ Note that in the ab ov e pro of we ca nnot replace the condition k ( m ) = o ( √ m ) with k ( m ) = O ( √ m ) since V apnik-Cher v onenkis b ounds do not allow us to detect O (1 / √ m )-difference b etw een probability masses of tw o buck ets. The following theorems shows that if there a re to o many buckets the buck eting algo rithm is not consistent an ymore. Theorem 2 . If the nu mb er of bu ckets k ( m ) = ω ( m/ log m ) , t hen B k is not c onsistent for F 1 . T o prov e the theor em w e nee d a prop osition of the following lemma dealing with the cla ssical coup on collector problem. Lemma 1 (The C o up on Col l ector Problem [11 ]). L et the ra ndom variable X denote the numb er of trials for c ol le cting e ach of the n typ es of c oup ons . Then for any c onstant c ∈ R , and m = n ln n + cn , lim n →∞ Pr[ X > m ] = 1 − e − e − c . Pr o of (of The or em 2). Co nsider the followin g dens it y f on [0 , 1 ], f ( x ) =      (4 − 16 x ) / 3 if x ∈ [0 , 1 4 ] (16 x − 4) / 3 if x ∈ ( 1 4 , 1 2 ) 4 / 3 if x ∈ [ 1 2 , 1] which attains unique minim um a t x ∗ = 1 / 4. F rom the a ssumption on the growth of k ( m ) for all sufficiently la rge m , k ( m ) > 4 a nd k ( m ) > 8 m/ ln m . Consider the a ll buc kets lying in the interv al [ 1 2 , 1] and denote them b y b 1 , b 2 , . . . , b n . Since the buck et size is les s than 1 / 4, they cover the int er v al [ 3 4 , 1]. Hence their length tota l length is at lea st 1 / 4 and hence there ar e n ≥ k ( m ) / 4 > 2 m/ ln m such buck ets. W e will s ho w that for m larg e enough, with pro babilit y at least 1 / 2 , at least one of the buck ets b 1 , b 2 , . . . , b n receives no sample po int. Since probability mas ses o f b 1 , b 2 , . . . , b n are the same, we ca n think of these buck ets as coup on types we are co llecting and the sample po in ts as coup ons. By Lemma 1, it suffices to verify , that the num b er of trials, m , is at most 1 2 n ln n . Indeed, we hav e 1 2 n ln n ≥ 1 2 2 m ln m ln  2 m ln m  = m ln m (ln m + ln 2 − ln ln m ) ≥ m , where the last inequality follows from tha t larg e eno ugh m . Now, Lemma 1 implies that for sufficiently large m , with pr obabilit y at least 1 / 2, at least one of the buc kets b 1 , b 2 , . . . , b n contains no sample point. If there ar e empt y buck ets in [ 1 2 , 1], the alg orithm outputs a p oint in [ 1 2 , 1]. Since this happ ens with probability at least 1 / 2 and since x ∗ = 1 / 4, the algor ithm canno t be cons isten t. ⊓ ⊔ When the num b er of buck ets k ( m ) is asymptotica lly s omewhere in b et ween √ m a nd m/ ln m , the buck eting a lgorithm switches from be ing co nsisten t to fa iling consis tency . It remains a n op en question to determine where exa ctly the transition o ccurs. 3.2 The Hard-Margin Algorithm Let the har d-mar gin algo rithm b e the function that outputs the mid-po in t of the lar gest interv a l b et ween the adjacent sa mple p oints. Mo re for mally , given a sample S of s ize m , the algor ithm sorts the sa mple S ∪ { 0 , 1 } so that x 0 = 0 ≤ x 1 ≤ x 2 ≤ · · · ≤ x m ≤ 1 = x m +1 and o utputs the midpoint ( x i + x i +1 ) / 2 where the index i , 0 ≤ i ≤ m , is such that the gap [ x i , x i +1 ] is the lar g est. Henceforth, the notion lar gest gap refers to the length of the la rgest interv al betw een the a djacen t po in ts o f a sample. Theorem 3 . The har d-mar gin algorithm is c onsistent for the family F 1 . T o pr o ve the theorem we need the following prop erty of the dis tribution of the la rgest ga p b et ween t wo adjacent elements of m points forming an i.i.d. sample fr om the uniform distribution on [0 , 1]. The sta tement of which we present an (up to our knowledge) new pro of has b een orig ina lly proven by L´ evy [10]. Lemma 2. L et L m b e the r andom variable denoting the lar gest gap b etwe en adja c ent p oints of an i.i.d. sample of size m fr om the uniform distribution on [0 , 1] . F or any ǫ > 0 lim m →∞ Pr  L m ∈  (1 − ǫ ) ln m m , (1 + ǫ ) ln m m  = 1 . Pr o of (of L emma). Consider the unifor m distribution ov er the unit circle . Supp ose we draw an i.i.d. sample of s ize m from this distribution. Let K m denote the size of the largest g ap betw een t wo adjacent samples. It is not ha rd so see that the distribution of K m is the sa me as that of L m − 1 . F urthermore , since ln( m ) /m ln( m +1) / ( m +1) → 1, w e can thus prov e the lemma with L m replaced b y K m . Fix ǫ > 0. First, le t us show that for m sufficiently large K m is with probability 1 − o (1) ab ov e the low er b ound (1 − ǫ ) ln m m . W e split the unit cir c le b = m (1 − ǫ ) ln m buck ets, each of length (1 − ǫ ) ln m m . It follows from Lemma 1, that for an y co ns tan t ζ > 0 and an i.i.d. sample of (1 − ζ ) b ln b po in ts at least o ne buc ket is empty with pro babilit y 1 − o (1). W e s how that for some ζ , m ≤ (1 − ζ ) b ln b . The expressio n on the right side can be rewritten as (1 − ζ ) b ln b = (1 − ζ )(1 + δ ) m ln m ln  (1 − ζ )(1 + δ ) m ln m  ≥ m (1 − ζ )(1 + δ )  1 − O  ln ln m ln m  F or ζ sufficiently small and m sufficiently large the la st expr e s sion is greater than m , yielding that a sample of m p oints misses at least one buck et with pro ba bilit y 1 − o (1). Therefore , the larges t gap K m is with pro babilit y 1 − o (1) at least (1 − ǫ ) ln m m . Next, we show that for m sufficiently large, K m is with probability 1 − o (1) below the upp er b ound (1 + ǫ ) ln m m . W e consider 3 /ǫ buc ketings B 1 , B 2 , . . . , B 3 /ǫ . E ach buck eting B i , i = { 1 , 2 , . . . , (3 /ǫ ) } , is a division of the unit circle into b = m (1+ ǫ/ 3) ln m equal length buckets; each bucket has length ℓ = (1 + ǫ / 3) ln m m . The buc keting B i will hav e its left end-p oin t of the first buc ket at p osition i ( ℓǫ/ 3). The po sition of the left end-p oint of the first buck et of a buck eting is ca lle d the offset of the buck eting. W e first show that there exists ζ > 0 suc h that m ≥ (1 + ζ ) b ln b for all sufficiently larg e m . Indeed, (1 + ζ ) b ln b = (1 + ζ ) m (1 + ǫ/ 3 ) ln m ln  m (1 + ǫ/ 3) ln m  ≤ 1 + ζ 1 + ǫ/ 3 m  1 − O  ln ln m ln m  . F or any ζ < ǫ/ 3 and sufficiently large m the last express ion is grea ter tha n m . The existence of such ζ and Lemma 1 guar an tee that for all sufficiently large m , for of each buck eting B i , with probability 1 − o (1), each buc ket is hit by a sample p oint. W e no w apply union b ound and g et that, for all sufficiently large m , with probability 1 − (3 /ǫ ) o (1) = 1 − o (1), for each buck eting B i , each buck et is hit b y at leas t one sample p oint. Consider any sample S such that for each buc keting, each buck et is hit by at least one p oint of S . T hen, the larges t g a p in S can no t b e big ger than the buck et size plus the difference of offsets b etw een t wo adjacent buck etings, since other wise the largest gap w ould demonstrate a n empty buck et in at least one o f the bucketings. In other, words the larg est gap, K m , is at most K m ≤ ( ℓǫ/ 3) + ℓ = (1 + ǫ/ 3) ℓ = (1 + ǫ/ 3) 2 ln m m < (1 + ǫ ) ln m m for any ǫ < 1. ⊓ ⊔ Pr o of (of the The or em) . Consider any tw o disjoint in ter v als U, V ⊆ [0 , 1] such that for any x ∈ U and any y ∈ V , f ( x ) f ( y ) < p < 1 for s o me p ∈ (0 , 1). W e claim that with pro babilit y 1 − o (1), the larges t ga p in U is bigger than the la rgest gap in V . If w e draw an i.i.d. sample m points fr om µ , a c c ording to the law of lar g e num b ers for a n arbitrarily small χ > 0, the ratio b etw een the num b er o f p oints m U in the interv al U and the num b er of p oin ts m V in the interv al V with probability 1 − o (1) sa tisfies m U m V ≤ p (1 + χ ) | U | | V | . (8) F or a fixed χ , choo s e a constant ǫ > 0 such that 1 − ǫ 1+ ǫ > p + χ . F rom Lemma 2 we show that with proba bilit y 1 − o (1) the larg est ga p b et ween adjace nt sample points falling in to U is at least (1 − ǫ ) | U | ln m U m U . Similarly , with pr obability 1 − o (1) the largest gap b e t ween adjacent sample p oints falling into V is at most (1 + ǫ ) | V | ln m V m V . F rom (8) it follows tha t the r atio of g ap sizes with proba bilit y 1 − o (1) is a t least (1 − ǫ ) | U | ln m U m U (1 + ǫ ) | V | ln m V m V > 1 − ǫ 1 + ǫ 1 p + χ ln m U ln m V = (1 + γ ) ln m U ln m V ≥ (1 + γ ) ln(( p + χ ) | U | | V | m V ) ln m V = (1 + γ ) (1 + O (1) / ln m V ) → (1 + γ ) as m → ∞ for a constant γ > 0 such that 1 + γ ≤ 1 − ǫ 1+ ǫ 1 p + χ . Hence for sufficient ly larg e m with probability 1 − o (1), the large st ga p in U is strictly bigger than the largest gap in V . Now, we can c ho ose in terv als V 1 , V 2 such that [0 , 1 ] \ ( V 1 ∪ V 2 ) is an arbitra rily small neigh b ourho o d containing x ∗ . W e can pick a n ev en smaller neighbour ho o d U containing x ∗ such that for all x ∈ U and all y ∈ V 1 ∪ V 2 , f ( x ) f ( y ) < p < 1 for some p ∈ (0 , 1). Then with pro babilit y 1 − o (1), the lar gest gap in U is bigger than larg est g ap in V 1 and the lar gest gap in V 2 . ⊓ ⊔ 4 Learning Linear Cut Separators in High Dimensions In this sectio n we co ns ider the problem o f learning the minimum density homog eneous (i.e. pa ssing through o rigin) linear cut in distributio ns ov er R d . Namely , assuming that some unknown pr o babilit y distribution gener ates i.i.d. finite sa mple of p oin ts in R d . W e wish to pro cess these s amples to find the ( d − 1)-dimensional h yp erplane, through the origin of R d , that has the low est proba bilit y density with resp ect to the sa mple-generating distribution. I n other words, we wish to find how to cut the space R d through the or igin in the “spar sest direction”. F ormally , let F d be the family o f a ll pro babilit y distributions ov er the R d that hav e a contin uous density function. W e wish to show that there exists a linear cut learning algorithm that is consistent for F d . Note by Theorem 5, no algor ithm ac hieves uniform con vergence for F d (even for d = 1). Define the soft-mar gin algorithm with parameter γ : N → R + as follo ws. Given a sample S of size m , it counts for every hyperpla ne, the num b er of sample points lying within distance γ := γ ( m ) and outputs the h yp erplane with the lowest such coun t. In case of the ties, it breaks them arbitra rily . W e denote this algorithm by H γ . F or mally , for any weight vector w ∈ S d − 1 (the unit sphere in R d ) we consider the “ γ -strip” h ( w , γ ) = { x ∈ R d : | w T x | ≤ γ } and co un t the n umber of sample p oints ly ing in it. W e output the w eight vector w for whic h the num ber of sample p oints in h ( w , γ ) is the smalles t; w e brea k ties arbitra rily . T o fully sp ecify the algo rithm, it remains to sp ecify the function γ ( m ). As it turns out, there is a choice of the function γ ( m ) which mak es the algo rithm consistent . Theorem 4 . If γ ( m ) = ω (1 / √ m ) and γ ( m ) → 0 as m → ∞ , then H γ is c onsistent for F d . Pr o of. The structure of the pro of is simila r to the pro of of Theor em 1. How ever, we will need mo r e techn ica l tools. First let’s fix f . F or a n y w eig h t vector w ∈ S d − 1 and any γ > 0, w e define f γ ( w ) as the d -dimensional int eg ral f γ ( w ) := Z h ( w ,γ ) f ( x ) d x ov er γ - strip along w . Note tha t for an y w ∈ S d − 1 , lim m →∞ f γ ( m ) ( w ) γ = f ( w ) (assuming that γ ( m ) → 0). In other words, the se q uence o f functions n f γ ( m ) /γ ( m ) o ∞ m =1 , f /γ ( m ) : S d − 1 → R + 0 , conv erg es p oint-wise to the function f : S d − 1 → R + 0 . Note that f /γ ( m ) : S d − 1 → R + 0 is contin uous for any m , and recall that S d − 1 is compact. Therefore the sequence n f γ ( m ) /γ ( m ) o ∞ m =1 conv erges uniformly to f . In other words, for e very ζ > 0 there exists m 0 such that for any m ≥ 0 and any w ∈ S d − 1 ,      f γ ( m ) ( w ) γ ( m ) − f ( w )      < ζ . Fix f and ǫ, δ > 0. Let U = { w ∈ S d − 1 : | w T w ∗ | > 1 − ǫ } be the “ ǫ - double-neighbourho o d” o f the antipo dal pair { w ∗ , − w ∗ } . The set S d − 1 \ U is compact and hence α := min f ( S d − 1 \ U ) exists. Since w ∗ , − w ∗ are the only minimizer s o f f , α > f ( w ∗ ) and hence η := α − f ( w ∗ ) is p ositive. The assumptions on γ ( m ) imply that there exis ts m 0 such that for all m ≥ m 0 , 2 r d + ln(1 /δ ) m < η 3 γ ( m ) (9)      f γ ( m ) ( w ) γ ( m ) − f ( w )      < η/ 3 for all w ∈ S d − 1 (10) Fix any m ≥ m 0 . F or any w ∈ S d − 1 \ U , w e hav e f γ ( m ) ( w ) γ ( m ) > f ( w ) − η / 3 by (10 ) ≥ f ( w ∗ ) + η − η / 3 by choice of η and U = f ( w ∗ ) + 2 η / 3 > f γ ( m ) ( w ∗ ) γ ( m ) − η / 3 + 2 η / 3 by (10 ) = f γ ( m ) ( w ∗ ) γ ( m ) + η / 3 . F rom the ab ov e c hain of inequalities, after multiplying b y γ ( m ), w e hav e f γ ( m ) ( w ) > f γ ( m ) ( w ∗ ) + η γ ( m ) / 3 . (11) F rom the well known V apnik-Chervonenkis b ounds [2], we have that with probability at least 1 − δ ov er i.i.d. draws of S o f size m w e hav e that for any w ,     | h ( w , γ ) ∩ S | m − f γ ( m ) ( w )     ≤ r d + ln(1 /δ ) m , (12) where | h ( w , γ ) ∩ S | denotes the num b er of sample p oints lying in the γ -strip h ( w , γ ). Fix any s a mple S satisfying the inequality (12). W e hav e, for any w ∈ S d − 1 \ U , | h ( w , γ ) ∩ S | m ≥ f γ ( m ) ( w ) − r d + ln (1 /δ ) m > f γ ( m ) ( w ∗ ) + η γ ( m ) / 3 − r d + ln (1 /δ ) m ≥ | h ( w ∗ , γ ) ∩ S | m − r d + ln(1 /δ ) m + η γ / 3 − r d + ln(1 / δ ) m > | h ( w ∗ , γ ) ∩ S | m Since | h ( w , γ ) ∩ S | > | h ( w ∗ , γ ) ∩ S | , the algorithm must not output a w eight vector w lying in S d − 1 \ U . In other words, the algorithm’s output, H γ ( S ), lies in U i.e. | H γ ( S ) T w ∗ | > 1 − ǫ . W e hav e proven, that for any ǫ, δ > 0, there exis ts m 0 such that for all m ≥ m 0 , if a sample S is drawn i.i.d. from f , then | H γ ( S ) T w ∗ | > 1 − ǫ . In other words, H γ is consistent for f . ⊓ ⊔ 5 The imp ossibilit y of Uniform Con vergence In this section we show a negative result that roughly says one ca nnot ho pe for an a lgorithm that can achiev e ǫ a c curacy and 1 − δ confidence fo r sample siz e s that only dep end on these par ameters and not on prop erties of the pr obabilit y measure. Theorem 5 . No line ar cut le arning algorithm is un iformly c onver gent for F 1 with r esp e ct to any of the distanc e fun ctions D E , D f and D µ . Pr o of. F or a fixed δ > 0 w e show that for any m ∈ N there are distributions with density functions f a nd g s uc h that no alg o rithm us ing a r andom sample of size at most m drawn from one of the distributions chosen uniformly at rando m, can iden tify the distribution with pr obability o f err or less than 1/ 2 with probability at least δ ov er rando m choices of a sample. Since for any δ a nd m we find densities f and g such that with probability more than (1 − δ ) the output of the a lgorithm is bo unded a wa y by 1 / 4 from either 1 / 4 or 3 / 4, for the family F 1 no algorithm conv erges uniformly w.r.t. any distance measure. Consider tw o partly linear densit y functions f and g defined in [0 , 1] suc h tha t for some n , f is linear in the interv als [0 , 1 4 − 1 2 n ], [ 1 4 − 1 2 n , 1 4 ], [ 1 4 , 1 4 + 1 2 n )], and [ 1 4 + 1 2 n , 1], and sa tis fies f (0) = f  1 4 − 1 2 n  = f  1 4 + 1 2 n  = f (1) , f  1 4  = 0 , and g m is the reflection of f m w.r.t. to the cent r e of the unit in ter v al, i.e. f ( x ) = g (1 − x ). The functions f and g can be simply describ e d as constant functions anywhere except of a thin V -shap e around 1 / 4 resp. 3 / 4 with the b ottom at 0 in e ac h of them. F or an y x / ∈ [ 1 4 − 1 2 n , 1 4 + 1 2 n ] ∪ [ 3 4 − 1 2 n , 3 4 + 1 2 n ], f ( x ) = g ( x ). 1 /n f x ∗ = 1 / 4 1 /n g x ∗ = 3 / 4 Fig. 1. f is uniform everywhere except a small neighbourho od around 1/4 where it has a sharp ‘v’ shap e. A nd g is the reflection of f ab ou t x = 1 / 2. Let us lo wer-b ound the probability that a sample o f size m dr a wn from f misses the set U ∪ V for U := [ 1 4 − 1 2 n , 1 4 + 1 2 n ] and V := [ 3 4 − 1 2 n , 3 4 + 1 2 n ]. F or an y x ∈ U and y / ∈ U , f ( x ) ≤ f ( y ), and furthermor e , f is co nstan t on the set [0 , 1] \ U containing at most the entire pr obability mass 1. There fo re, for p f ( Z ) denoting the pr obabilit y that a p o in t dr a wn from the distribution with the density f hits the set Z , we hav e p f ( U ) ≤ p f ( V ) ≤ 1 n − 1 , yielding that p f ( U ∪ V ) ≤ 2 n − 1 . Hence, an i.i.d. sample of size m misse s U ∪ V with probability at least (1 − 2 / ( n − 1 )) m ≥ (1 − η ) e − 2 m/n for an y constant η > 0 and n sufficiently large. F or a prop er η and n s ufficien tly large we get (1 − η ) e − 2 m/n > 1 − δ . F rom the symmetry betw een f a nd g , a random sample of s ize m dra wn fro m g misses U ∪ V with the same probability . W e have shown that for any δ > 0, m ∈ N , and for n sufficiently larg e, r egardless of whether the sample is drawn from either o f the tw o distributions, it do es not intersect U ∪ V with probability more than 1 − δ . Since in [0 , 1] \ ( U ∪ V ) b oth density functions a re equal, the pro babilit y o f error in the discrimination betw een f and g conditioned on that the sample does no t in tersect U ∪ V canno t be less than 1 / 2 . ⊓ ⊔ 6 Conclusions and op en questions In this pap er have presented a no vel unsup ervised learning pro blem that is mo dest eno ugh to allow lea rn- ing alg o rithm with asymptotic lea rning guarantees, while be ing relev ant to several central challenging learning ta sks. Our analy sis can be viewed as providing justificatio n to some common semi-sup ervised learning par adigms, such a s the maximization of margins over the unlabe led sample or the search for empirically-spa rse separating hyperplanes . As far a s we know, our results provide the firs t p erformance guarantees for these paradig ms. F rom a more general pers pective, the pap er demonstrates some type of meaningful information ab out a data generating probability distribution that can b e reliably learned from finite ra ndom samples of that distribution, in a fully non-pa rametric mo del – without po s tulating any prio r as s umptions a bout the str uc tur e of the data distr ibution. As such, the sear c h for a low-densit y data separating hyperpla ne can b e v ie w ed a s a basic to ol for the initial analysis of unknown data. Analy s is that ca n be car ried out in situations where the lea rner has no prior knowledge ab o ut the data in question a nd ca n only access it via unsup ervised random sampling. Our analysis r aises s ome intriguing op en questions. Fir st, note that while we prove the univ er sal consistency o f the ‘har d-margin’ algor ithm for Real da ta distributions, w e do not hav e a similar res ult for higher dimens io nal data. Since se a rching for empirical maximal margins is a common heuristic, it is int er esting to resolve the question of consis tency of suc h a lgorithms. Another natural resear c h direction that this work calls for is the extension of our results to mo re complex sepa rators. In clus tering, for example, it is co mmon to sear c h for clusters tha t are sepa rated by sparse data regions . howev er, such betw een- c lus ter b oundaries ar e often no t linear. Ca n one provide any relia ble algorithm for the detection of spa r se b oundaries from finite random sa mples when these bo undaries b e long to a richer family of functions? Our r e search has fo cused on the informa tion complexity of the task. How ever, to ev aluate the pr ac- tical usefulness of our prop osed a lgorithms, one s hould a lso ca rry a computationa l c omplexit y a nalysis of the low-density se pa ration task . W e conjecture tha t the pr o blem o f finding the ho mogeneous hyper - plane with larges t margins, or lowest densit y a round it (with res p ect to a finite hig h dimensional s et of po in ts) is NP-hard (when the Euclidean dimensio n is considered as pa rt of the input, rather than as a fixed co nstan t par a meter). how ever, ev en if this conjecture is true, it will b e interesting to find efficient approximation algorithms for these problems. Ac kno wledg emen ts. W e would like to thank Noga Alon for a fruitful discussion. References 1. C. Scott A. Sin gh and R. Now ak. Adaptive hausdorff estimation of density lev el sets. 2007. http://www .eecs.umich.edu/ cscott/pubs . 2. Martin A n th ony and Peter L. Bartlett. Neur al Network L e arning: The or etic al F oundations . Cambridge Universit y Press, 1999. 3. Shai Ben- Da vid, N ada v Eiron, and H ans-Ulric h Simon. The computational complexity of densest region detection. J. Comput. Syst. Sci. , 64(1):22 –47, 2002. 4. Shai Ben-D a vid and Mic hael Lind en baum. Learning distributions by th eir density levels: A paradigm for learning without a teac her. J. Comput. Syst. Sci. , 55(1):171–1 82, 1997. 5. Shai Ben-David and T yler Lu D´ avid P´ al. Do es unlabeled data prov ably h elp? w orst-case analysis of th e sample complexity of semi-sup ervised lea rning. I n COL T , 2008. 6. Shai Ben-Da v id and U lrik e von Lux burg. Relating clustering stability to properties of cluster b oundaries. In COL T , 20 08. 7. O. Chap elle, B. Sch¨ olkopf, and A. Zien, editors. Semi-Sup ervise d L e arning . MIT Press, Cam bridge, MA, 2006. 8. Luc Dev ro ye and G´ abor Lu gosi, editors. Combinatorial Metho ds in Density Estimation . Springer-V erlag, 2001. 9. Thorsten Joac hims. T ransductive inference for text classification using supp ort vector machines. In ICML , pages 200–209, 1999. 10. P aul L´ evy . Sur la d iv ision d’un segment par des p oin ts choisis au hasard. C.R. A c ad. Sci. Paris , 208:147–14 9, 1939. 11. Ra jeev Mot wa ni and Prabhak ar R agha v an. R andomize d Algorith ms . Cambridge Univ ersity Press, 19 95. 12. Ohad Sh amir and Naftali Tishb y . Model selection and stability in k-means clustering. In COL T , 2008 . 13. A. B. Tsybako v. On nonparametric estimation of density lev el sets. The A nnals of Statistics , 25(3):948–969, 1997.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment