An Analysis of the Convergence of Graph Laplacians
Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a well-behaved kernel function with smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previousl…
Authors: Daniel Ting, Ling Huang, Michael Jordan
An Analysis of the Con v ergence of Graph Laplacians Daniel Ting Departmen t of Statistics Univ ersit y of California, Berk eley Ling Huang In tel Researc h Mic hael Jordan Departmen t of EECS and Statistics Univ ersit y of California, Berk eley Ma y 30, 2018 Abstract Existing approac hes to analyzing th e asymptotics of graph La placians typicall y assume a w ell-b ehav ed kernel function with smoothn ess assump- tions. W e remov e the smo othness assumption and generalize the analysis of graph Laplacians to include previously unstu died graphs including kNN graphs. W e also introduce a kernel-free framew ork to analyze graph con- structions with shrinking n eighborho od s in general and apply it to analyze locally linear em b eddin g (LLE). W e also describ e how for a given limiting Laplacian op erator desirable p rop erties suc h as a converg ent spectru m and sparseness can b e achiev ed choosing the app ropriate graph construction. 1 In tro duction Graph Laplacians hav e b ecome a co re technology thro ughout machine lear n- ing. In par ticular, they have appear ed in cluster ing Ka nnan et al. (200 4); von Lux burg et a l. (200 8), dimensionality reduction Belkin & Niyogi (2003); Nadler et al. (200 6), and s e mi-sup ervised learning Belkin & Niyogi (2004); Zhu et al. (2003). While graph La placians are but one member o f a bro ad class of metho ds that use lo cal neigh b o rho o d graphs to mo del data lying on a low-dimensional manifold embedded in a hig h- dimensional space, they are distinguished b y their app ealing ma thema tical pr op erties, no tably: (1 ) the graph Laplacian is the in- finitesimal g enerator fo r a random walk on the graph, and (2) it is a disc rete ap- proximation to a weigh ted Lapla c e-Beltrami op era to r on a manifold, an op erato r which has numerous geometric prop e rties and induces a smo othness functiona l. These mathematical pro p e rties have served as a foundation for the development of a growing theoretical literature that has ana lyzed learning pro cedures based 1 on the g raph Laplacian. T o review br iefly , Bousquet et al. (200 3) prov ed an early result for the conv ergence of the unnorma liz ed g r aph Laplacian to a r eg- ularization functional tha t dep ends on the squar ed density p 2 . Belkin & Niyogi (2005) demonstrated the p oint wise conv ergence o f the e mpir ical unnormaliz e d Laplacian to the Lapla ce-Beltra mi op erato r on a compact manifold with uni- form density . Lafon (200 4) and Nadler et al. (2006) e s tablished a connection betw een gra ph Lapla cians and the infinitesimal genera tor of a diffusion pro- cess. They further showed that one may use the degree oper ator to control the effect of the density . Hein et al. (2005) combined and genera lized these re - sults for weak and p oint wise (strong) conv erg e nce under weak er a ssumptions as well as providing rates for the unnor malized, normalized, and r andom walk Laplacians. They also make explicit the connections to the w eighted Laplace- Beltrami op er ator. Singer (2 006) obta ined improved conv erg ence rates for a uniform density . Gin´ e & Koltchinskii (200 5) established a uniform c onv ergence result and functional central limit theorem to extend the po int wise conv ergence results. v on L ux burg et al. (200 8) and Belkin & Niyogi (2006) presented s pe c - tral conv ergence results for the eigenv ectors of g r aph Laplacians in the fixed and shrinking bandwidth cases r esp ectively . Although this burgeoning litera ture has provided many useful ins ights, s ev- eral gaps remain b etw een theory and pra ctice. Most notably , in constructing the neighborho o d graphs underlying the g r aph Laplacian, s everal choices m ust be made, including the choice of algor ithm for constr uc ting the graph, with k - nearest-neig hbor (k NN) a nd kernel functions providing the main alternatives, as well as the choice of para meters ( k , kernel bandwidth, no rmalization weigh ts). These choices ca n lead to the graph Laplacian gener ating fundamen tally differ- ent rando m walks and appr oximating different weigh ted Laplace-B e ltr ami op- erators . The existing theory ha s fo cused on o ne sp ecific choice in which g raphs are ge ne r ated with smo o th kernels with shrinking bandwidths. But a v a riety of other choices ar e often made in pra ctice, including kNN g r aphs, r - neighborho o d graphs, a nd the “ self-tuning” graphs of Zelnik-Mano r & Perona (200 4). Sur - prisingly , few o f the exis ting conv erge nce results a pply to these choices (see Maier et a l. (2 008) for a n exception). This pap er pro vides a gener al theoretical fr a mework fo r analyzing graph Laplacians and op era tors that b ehav e like Laplacians . O ur p oint of view differs from that found in the existing literature; sp ecifica lly , o ur p oint of departure is a sto chastic pr o cess framework that utilizes the characteriza tio n of diffusion pro cesses via drift and diffusion terms. This y ields a g e ne r al kernel-free frame- work for ana lyzing gr aph Laplacia ns with shrinking neighborho o ds. W e use it to extend the p o int wise results of Hein e t al. (2 007) to cov er non-smo o th kernels and introduce lo ca tion-dep endent bandwidths. Applying these to ols we a re able to identif y the asymptotic limit for a v ariety of gr a phs c onstructions including kNN, r -neighbor ho o d, a nd “self-tuning” graphs. W e a r e also able to provide an analysis for Lo cally Line a r Embedding (Row eis & Sa ul, 2000). A practical mo tiv ation for our interest in g raph L a placians based o n kNN graphs is that thes e ca n be significantly sparser than those construc ted using kernels, even if they hav e the same limit. O ur framework allows us to establish 2 this limiting equiv alence. On the other hand, we can also exhibit cases in which kNN g raphs conv erge to a different limit tha n gra phs constructed from kernels, and tha t this ex plains some cases where kNN gra phs p erfor m p o o rly . Moreover, our framework allows us to gene r ate new a lgorithms: in par ticular, by using lo cation-dep endent bandwidths we obtain a class of op erato r s that hav e nice sp ectral conv erge nc e prop er ties that para llel those of the no rmalized Laplacia n in von Luxburg et al. (200 8), but which converge to a different class of limits. 2 The F ramew ork Our work exploits the connections a mong diffusion pr o cesses, elliptic o p erator s (in par ticular the weighted Laplac e-Beltrami op er a tor), and sto chastic differ- ent ial equations (SDEs). This builds up on the diffusion pro ce s s viewp oint in Nadler et al. (20 06). Critically , we make the connec tio n to the drift a nd dif- fusion terms of a diffusion pro cess. This allows us to present a kernel-free framework for analysis o f g r aph Laplac ia ns as well as giving a be tter in tuitive understanding o f the limit diffusion pr o cess. W e first give a brief ov erview of these connections and present our genera l framework for the asymptotic a nalysis of g raph Laplacia ns as well as provid- ing some relev an t background material. W e then introduce our ass umptions and derive our main r esults for the limit op era to r for a wide rang e of gr aph construction metho ds. W e us e these to calculate asymptotic limits for sp ecific graph constructio ns. 2.1 Relev an t Differen tial Geometry Assume M is a m -dimensional manifold embedded in R b . T o iden tify the asymp- totic infinitesimal gene r ator of a diffusion on this manifold, we will derive the drift and diffusion terms in normal c o ordinates at ea ch p o int. W e refer the reader to Bo othb y (1986) for a n exact definition of no r mal co ordinates . F or our purp oses it suffices to note that nor mal co ordinates are co or dinates in R m that behave ro ughly a s if the neighbor ho o d was pro jected on to the tangent pla ne at x . The ex tr insic co o rdinates are the co ordinates R b in which the ma nifold is embedded. Since the density , and hence integration, is defined with resp ect to the manifold, w e mu st relate to link nor mal co o r dinates s ar o und a p oint x with the extrins ic co or dinates y . This relation may b e given as follows: y − x = H x s + L x ( ss T ) + O ( s 3 ) , (1) where H x is a linear iso morphism b e tw een the normal co ordinates in R m and the m -dimensio nal tangent plane T x at x . L x is a linear op erator descr ibing the curv ature of the manifold and takes m × m p ositive s emidefinite matrice s into the space o rthogo na l to the tangent plane, T ⊥ x . Mor e adv anced reader s will note that this statement is Ga uss’ le mma and H x and L x are re la ted to the firs t and second fundamental forms. 3 W e are most interested in limits inv olving the weigh ted Laplace-B e ltrami op erator , a particular s econd-or de r differential op er a tor. 2.2 W eigh t ed Laplace-Beltrami op erator Definition 1 (W eight ed Laplace-Beltr ami o p er ator) . The weighte d L aplac e- Beltr ami op er ator with r esp e ct to t he density q is the s e c ond-or der differ ential op er ator define d by ∆ q := ∆ M − ∇ q T q ∇ wher e ∆ M := div ◦ ∇ is the un weighte d L apla c e-Beltr ami op er ator. It is of particular interest since it induces a smo othing functional for f ∈ C 2 ( M ) with supp ort contained in the interior of the manifold: h f , ∆ q f i L ( q ) = k∇ f k 2 L 2 ( q ) . (2) Note that existing literature on asymptotics of gra ph Lapla cians often refers to the s th weigh ted Laplace-B e ltrami op erator as ∆ s where s ∈ R . This is ∆ p s in our notation. F o r mo re information o n the weigh ted La place-Beltr a mi op erato r see Grigo r’yan (200 6). 2.3 Equiv alence of Limiting Characterizations W e now establish the promised connections amo ng elliptic op erator s, diffusions, SDEs, and gra ph Lapla c ia ns. W e fir s t s how that elliptic op era tors define dif- fusion pro cesses a nd SDEs and v ice versa. An elliptic op erator G is a second order differential op erator o f the form G f ( x ) = X ij a ij ( x ) ∂ 2 f ( x ) ∂ x i ∂ x j + X i b i ( x ) ∂ f ( x ) ∂ x i + c ( x ) f ( x ) , where the m × m co efficient ma tr ix ( a ij ( x )) is p ositive semidefinite for all x . If we use no rmal co or dinates for a manifold, we s ee that the weigh ted Laplace- Beltrami op erato r ∆ q is a sp ecial case o f an elliptic o p er ator with ( a ij ( x )) = I , the identit y matrix, b ( x ) = ∇ q ( x ) q ( x ) , and c ( x ) = 0. Diffusion pr o cesses ar e related via a result b y Dynkin which states that given a diffusion pro cess , the g enerator of the pr o cess is an elliptic op er ator. The (infinitesimal) genera tor G of a diffusion pro ces s X t is defined as G f ( x ) := lim t → 0 E x f ( X t ) − f ( x ) t when the limit ex ists a nd co nvergence is uniform over x . Here E x f ( X t ) = E ( f ( X t ) | X 0 = x ). A co nv erse rela tion holds as w ell. The Hille-Y osida theore m characterizes when a linear op era tor, s uch a s an elliptic op erato r, is the generator of a sto chastic pr o cess. W e refer the rea der to Ka lle nberg (2002) for pro ofs. A time-ho mogeneous sto chastic differential equation (SDE) defines a diffu- sion pro c ess as a solution (when one exists) to the e q uation dX t = µ ( X t ) dt + σ ( X t ) dW t , 4 where X t is a diffusio n pro ces s taking v alues in R d . The terms µ ( x ) and σ ( x ) σ ( x ) T are the drift and diffusion ter ms of the pro cess. By Dynkin’s result, the genera tor G of this proce ss defines an elliptic op er ator and a s imple calcula tion shows the op era tor is G f ( x ) = 1 2 X ij σ ( x ) σ ( x ) T ij ∂ 2 f ( x ) ∂ x i ∂ x j + X i µ i ( x ) ∂ f ( x ) ∂ x i . In such diffusion pr o cesses there is no abs orbing sta te and the term in the elliptic op erato r c ( x ) = 0 . W e note that one may also consider more gener al diffusion pr o cesses wher e c ( x ) ≤ 0. When c ( x ) < 0 then we hav e the genera tor of a diffusion pro cess with killing where c ( x ) determines the killing rate o f the diffusion a t x . T o s umma r ize, we se e that a SDE o r diffusion pro cess define a n elliptic op erator , and imp ortantly , the co efficients ar e the drift and diffusion ter ms, and the r everse relationship holds: An elliptic op er ator defines a diffusion under some reg ularity conditions on the co efficients. All that remains then is to co nnect diffusion pro cess e s in contin uous spa ce to gr aph Laplacians on a finite set of po int s. Diffusion approximation theor ems provide this connection. W e state one version of such a theorem . Theorem 2 (Diffusion Approximation) . L et µ ( x ) and σ ( x ) σ ( x ) T b e drift and diffusion t erms for a diffusion pr o c ess define d on a c omp act set S ⊂ R b , and let and G b e the c orr esp ondi ng infinitesimal gener ator. L et { Y ( n ) t } t b e Markov chains with t r ansition matric es P n on state sp ac es { x i } n i =1 for al l n , and let c n > 0 define a se quenc e of sc alings. Put ˆ µ n ( x i ) = c n E ( Y ( n ) 1 − x i | Y ( n ) 0 = x i ) ˆ σ n ( x i ) ˆ σ n ( x i ) T = c n V ar( Y ( n ) 1 | Y ( n ) 0 = x i ) . L et f ∈ C 2 ( S ) . If for al l ǫ > 0 ˆ µ n ( x i ) → µ ( x i ) , ˆ σ n ( x i ) ˆ σ n ( x i ) T → σ ( x i ) σ ( x i ) T , c n sup i ≤ n P Y ( n ) 1 − x i > ǫ Y ( n ) 0 = x i → 0 , then the gener ators A n f = c n ( P n − I ) f → Gf F urt hermor e, for any b ounde d f and t 0 > 0 and the c ontinuous- time tr ansition kernels T n ( t ) = exp ( tA n ) and T the tr ansition kernel for G , we have T n ( t ) f → T ( t ) f uniformly in t for t < t 0 . Pr o of. W e first e xamine the case when f ( x ) = x . By as sumption, A n π n x = c n ( P n − I ) x = c n E ( Y ( n ) 1 − x i | Y ( n ) 0 = x i ) = µ n ( x ) → µ ( x ) = Ax. 5 Similarly if f ( x ) = xx T , k A n π n f − Af k ∞ → 0. If f ( x ) = 1, then A n π n f = π n Af = 0. Thus, by linearity of A n , A n π n f → Af for any qua dratic po ly nomial f . T aylor expand f to obtain f ( x + h ) = q x ( h ) + δ x ( h ) wher e q x ( h ) is a quadratic po lynomial in h . Since the second deriv ativ e is contin uous a nd the s uppo rt o f f is co mpact, sup x ∈M δ x ( h ) = o ( k h k 2 ) and sup x,h δ x ( h ) < M for some constant M . Let ∆ n = Y ( n ) 1 − x i . W e ma y b ound A n acting on the rema inder term δ x ( h ) by sup x A n δ x = c n E ( δ x (∆ n ) | Y ( n ) 0 = x ) ≤ sup x c n E ( δ x (∆ n ) 1 ( k ∆ n k ≤ ǫ ) | Y ( n ) 0 = x )+ M sup x c n P ( k ∆ n k > ǫ | Y ( n ) 0 = x ) = o ( c n E ( k ∆ n k 2 | Y ( n ) 0 = x )) + M sup x c n P ( k ∆ n k > ǫ | Y ( n ) 0 = x ) = o (1) where the last equality ho lds by the ass umptions on the uniform co nvergence of the diffusion term ˆ σ n ˆ σ T n and o n the s hrinking jumpsizes. Thu s, A n π n f → Af for a ny f ∈ C 2 ( M ). The class of functions C 2 ( M ) is dense in L ∞ ( M ) and form a core fo r the generator A . Standar d theorems give equiv alence betw een strong conv erg ence o f infinitesimal g enerator s o n a co r e and unifor m strong co nv ergence of tra nsition kernels on a Banach spa ce (e.g. Theo rem 1.6 .1 in Ethier & Kurtz (1986)). W e rema rk that though the r e sults we hav e discussed thus far are sta ted in the context of the e x trinsic co ordinates R b , we descr ibe appropr ia te extensions in terms of norma l co ordinates in the app e ndix . 2.4 Assumptions W e describ e he r e the assumptions a nd notatio n for the rest o f the pa p e r . The following as sumptions we will refer to as the st andar d assumptions . Unless stated explicitly otherwise , let f b e an arbitr ary function in C 2 ( M ). Manifold assumpti o ns. Assume M us a smo oth m -dimensio na l manifold isometrically embedded in R b via the map i : M → R b . The essential conditions that we require on the ma nifold are 1. Smo othness, the map i is a smo o th embedding. 2. A single radius h 0 such that for all x ∈ supp ( f ), M ∩ B ( x, h 0 ) is a neigh- bo rho o d o f x with nor mal co o rdinates, and 3. Bounded curv ature of the manifold ov er supp ( f ), i.e. that the second fundamen tal form is b ounded . 6 When the manifold is smo o th and compact, then these conditions are satisfied. Assume po int s { x i } ∞ i =1 are sampled i.i.d. from a density p ∈ C 2 ( M ) with resp ect to the na tural volume element o f the ma nifo ld, and tha t p is bo unded aw ay from 0 . Notation. F or brevity , we will always use x, y ∈ R b to b e po ints on M ex - pressed in extrinsic co ordina tes and s ∈ R m to b e norma l co ordina tes for y in a neighborho o d centered at x . Since they repre s ent the sa me p oint, we will also use y and s in terchangeably as function arguments, i.e. f ( y ) = f ( s ). Whenever we take a g r adient,it is with re sp ect to normal c o ordinates. Generalized k ernel. Though we use a kernel free framework, our main theo- rem utilizes a kernel, but o ne tha t is generalizes previously studied kernels by 1) considering non-smo oth base kernels K 0 , 2) intro ducing lo cation de p endent bandwidth functions r x ( y ), a nd 3) consider ing ge neral weigh t functions w x ( y ). Our main result also handles 4) rando m w eight and bandwidth functions. Given a bandwidth scaling par ameter h > 0, define a new kernel by K ( x, y ) = w x ( y ) K 0 k y − x k hr x ( y ) . (3) Previous ly analyzed constructions for smo oth kernels with compact suppor t are descr ib e d b y this more genera l kernel with r x = 1 and w x ( y ) = d ( x ) − λ d ( y ) − λ where d ( x ) is the degree function and λ ∈ R is some c o nstant. The directed kNN g r aph is obtained if K 0 ( x, y ) = 1 ( k x − y k ≤ 1), r x ( y ) = distance to the k th nearest neig hbor of x , and w x ( y ) = 1 for all x, y . W e note that the kernel K is not necessa rily symmetric; howev er, if r x ( y ) = r y ( x ) and w x ( y ) = w y ( x ) for all x, y ∈ M then the k ernel is symmetric and the corres p o nding unnor malized La pla cian is po sitive s emi-definite. Kernel assumptio ns. W e now introduce our assumptions on the choices K 0 , h, w x , r x that gov ern the graph constructio n. Assume that the base ker- nel K 0 : R + → R + has b ounded v ariatio n and compact supp ort and h n > 0 form a sequence of bandwidth scaling s. F or (p ossible random) lo c ation de- pendent ba ndwidth and weigh t functions r ( n ) x ( · ) > 0 , w ( n ) x ( · ) ≥ 0 , ass ume that they c onv erge to r x ( · ) , w x ( · ) resp ectively and the conv ergence is uniform over x ∈ M . F urther assume they hav e T aylor-like e x pansions for all x, y ∈ M with k x − y k < h n r ( n ) x ( y ) = r x ( x ) + ( ˙ r x ( x ) + α x sign( u T x s ) u x ) T s + ǫ ( n ) r ( x, s ) w ( n ) x ( y ) = w x ( x ) + ∇ w x ( x ) T s + ǫ ( n ) w ( x, s ) (4) where the approximation error is uniformly b ounded by sup x ∈M , k s k 0 dep ending only on the b ase kernel K 0 and the dimension m su ch that for c n = Z K 0 ,m /h 2 , − c n L ( n ) r w f → Af wher e A is the infinitesimal gener ator of a diffusion pr o c ess with the fol lowing drift and diffusion terms given in normal c o or dinates: µ s ( x ) = r x ( x ) 2 ∇ p ( x ) p ( x ) + ∇ w ( x ) w ( x ) + ( m + 2) ˙ r x ( x ) r x ( x ) , σ s ( x ) σ s ( x ) T = r x ( x ) 2 I 8 wher e I is the m × m identity matrix. Pr o of. W e apply the diffusion a pproximation theo rem (Theor em 2) to obtain conv ergence of the rando m walk Laplacia ns. Since h n ↓ 0, the probability of a jump of size > ǫ equals 0 even tually . Thus, we simply need to show uniform conv ergence of the dr ift and diffusion terms and identify their limits. W e leav e the detailed calculations in the app endix and present the main ideas in the pro of here. W e first a ssume that K 0 is a n indicato r kernel. T o ge neralize, we note that for kernels of b ounded v ariation, we may wr ite K 0 ( x ) = R 1 ( | x | < z ) dη + ( z ) − R 1 ( | x | < z ) dη − ( z ) fo r so me finite p ositive measur e s η − , η + with compact sup- po rt. The result for general kernels then follows from F ubini’s theorem. W e also initially as sume tha t w e ar e g iven the true density p . After identify- ing the desired limits given the true density , we show that the empirical version conv erges uniformly to the correct quantities. The key ca lculation is lemma 7 in the app endix which e s tablishes that int e- grating aga inst a n indicato r kernel is like integrating over a sphere r e - centered on h 2 n ˙ r x ( x ). Given this calculation a nd by T aylor expanding the non-kernel terms, one obtains the infinitesimal first and se c o nd moments and the degree op erator . M ( n ) 1 ( x ) = 1 h m n Z sK n ( x, y ) p ( y ) ds = 1 h m n Z sw ( n ) x ( s ) K 0 k y − x k h n r ( n ) x ( s ) ! p ( s ) ds = 1 h m n Z s w x ( x ) + ∇ w x ( x ) T s + O ( h 2 n ) p ( x ) + ∇ p ( x ) T s + O ( h 2 n ) × × K 0 k y − x k h n r ( n ) x ( s ) ! ds = C K 0 ,m h 2 n r x ( x ) m +2 w x ( x ) ∇ p ( x ) m + 2 + p ( x ) ∇ w x ( x ) m + 2 + w x ( x ) p ( x ) ˙ r x ( x ) + o (1) M ( n ) 2 ( x ) = 1 h m n Z ss T K n ( x, y ) p ( y ) ds = 1 h m n Z ss T w ( n ) x ( s ) K 0 k y − x k h n r ( n ) x ( s ) ! p ( s ) ds = 1 h m n Z ss T ( w x ( x ) + O ( h n )) ( p ( x ) + O ( h n )) K 0 k y − x k h n r ( n ) x ( s ) ! ds = C K 0 ,m m + 2 h 2 n r x ( x ) m +2 ( w x ( x ) p ( x ) I + O ( h n )) , 9 d n ( x ) = 1 h m n Z K n ( x, y ) p ( y ) ds (6) = 1 h m Z w ( n ) x ( s ) K 0 k y − x k h n r ( n ) x ( s ) ! p ( s ) ds (7) = 1 h m Z ( w x ( x ) + O ( h n )) ( p ( x ) + O ( h n )) K 0 k y − x k h n r ( n ) x ( s ) ! ds (8) = C ′ K 0 ,m r x ( x ) m ( w x ( x ) p ( x ) + O ( h n )) (9) where C K 0 ,m = R u m +2 dη , C ′ K 0 ,m = R u m dη and η is the signed measure η = η + − η − . Let Z K 0 ,m = ( m + 2) C ′ K 0 ,m C K 0 ,m and c n = Z K 0 ,m /h 2 n . Since K n /d n define Marko v transitio n kernels, taking the limits µ s ( x ) = lim n →∞ c n M ( n ) 1 ( x ) /d n ( x ) and σ s ( x ) σ s ( x ) T = lim n →∞ c n M ( n ) 2 ( x ) /d n ( x ) and applying the diffusion approximation theorem gives the stated result. T o more formally apply the diffusion approximation theo rem we may calcu- late the drift and diffusion in extrinsic co ordinates. In extrinsic co or dinates, we hav e µ ( x ) = r x ( x ) 2 H x ∇ p ( x ) p ( x ) + ∇ w x ( x ) w x ( x ) + ( m + 2) ˙ r x ( x ) r x ( x ) + r x ( x ) 2 L x ( I ) , σ ( x ) σ ( x ) T = r ( x ) 2 Π T x , where Π T x is the pro jection onto the ta ngent plane at x , and H x and L x are the linear mappings b etw een nor mal co or dinates and extr ins ic co o rdinates defined in Eq n (1). W e now consider the conv ergence of the e mpir ical qua nt ities. F or non- random r ( n ) x = r x , w ( n ) x = w x , the uniform and a lmost sure conv erge nce of the e mpir ical quantities to the tr ue exp ectation follows fro m an application of Bernstein’s inequa lity . In particular , the v alue of F n ( x, S ) = S i K k Y − x k h n r x ( Y ) is bo unded by K max h n , wher e S is Y in normal co ordinates and K max depe nds on the kernel a nd the maximum curv ature of the manifold. F urthermor e, the second moment calculatio n for M ( n ) 2 gives that the v ariance V ar ( F n ( x, S )) is bo unded by ch m +2 n for s o me constant c that dep ends on K and the max of p , and do es not dep end on x . B y Ber nstein’s inequa lity and a union bo und, we 10 hav e P r sup i ≤ n E n 1 h m +2 n F n ( x i , Y ) − 1 h 2 n M ( n ) 1 > ǫ = P r sup i ≤ n | E n F n ( x i , Y ) − E F n ( x i , Y ) | > ǫh m +2 n < 2 n exp − ǫ 2 2 c/ ( nh m +2 n ) + 2 K max ǫ/ (3 nh m +1 n ) . (10) The uniform conv ergence a.s. of the first mo ment follows from Bore l- Cantelli. Similar inequalities a re attained for the empirical second moment a nd degree terms. Now a s sume r ( n ) x , w ( n ) x are rando m and define F n as b efor e . T o handle the random weight and bandwidth function case, we first choose determin- istic weight and bandwidth functions to maximize the first momen t under a constraint that is satisfied event ually a.s.. Define w ( n ) x ( y ) = w x ( y ) + κh 2 n sig n ( s i ) r ( n ) x ( y ) = r x ( x ) + ( ˙ r x ( x ) + α x sign( u T x s ) u x ) T s − κh 2 n sig n ( s i ) F n ( y ) = s i w ( n ) x ( y ) K 0 k y − x k h n r ( n ) x ( y ) ! for s o me constant κ s uch that r ( n ) x < r ( n ) x and w ( n ) x > w ( n ) x even tually . This is po ssible since the p erturbation terms ǫ ( n ) r ( x, s ) , ǫ ( n ) w ( x, s ) = O ( h 2 n ). Thus, we hav e F κ,n ( x, y ) > F n ( x, y ) for all x, y ∈ M even tually with proba bility 1. Since F κ,n ( x, Y ) uses deterministic weigh t a nd ba ndwidth functions, we obtain i.i.d. random v a riables and may apply the Bernstein b ound on F κ,n ( x, y ) to obtain an upper b ound o n the empirical qua ntit ies, namely E n F κ,n ( x, Y ) > E n F n ( x, Y ) for a ll x ∈ M even tually with pro bability 1 . W e may similarly obtain a low er bo und. B y lemma 10, the difference b etw een the exp ectatio n of the upp er bo und a nd the is E F κ,n ( x, Y ) − E F 0 ,n ( x, Y ) = o ( κh m +2 n ). Applying the s queeze theorem gives a.s. unifor m convergence of the empirical first moment M ( n ) 1 /h 2 n . The degree and second moment terms are handled similarly . Since p, w x , r x are a ll a ssumed to b e b o unded awa y from 0 , the scaled degree op erator s d n are even tually b ounded aw ay from 0 with pro bability 1, and the contin uous ma pping theorem applied to M ( n ) i /h 2 n d n gives a.s. uniform conv ergence of the dr ift and diffusion. 2.6 Unnormalized and Normalized Laplacians While our r esults ar e for the infinitesima l gener ator of a diffusion pro ce ss, that is, for the limit of the random walk La placian L r w = I − D − 1 W , it is easy to generalize them to the unnormalized L aplacian L u = D − W = D L r w and sy m- metrically no rmalized Laplacia n L nor m = I − D − 1 / 2 W D − 1 / 2 = D 1 / 2 L r w D − 1 / 2 . 11 Corollary 4. T ake the assumptions in The or em 3, and let A b e the limiting op- er ator of t he r andom walk L aplaci an. The de gr e e t erms d n ( · ) c onver ge uniformly a.s. to a function d ( · ) , and − c ′ n L ( n ) u f → d · Af a.s. wher e c ′ n = c n /h m . F urthermor e, under the additional assumptions nh m +4 n / log n → ∞ , sup x,y | w ( n ) x − w x | = o ( h 2 n ) , s up x,y | r ( n ) x − r x | = o ( h 2 n ) , and d, w x , r x ∈ C 2 ( M ) , we have − c n L ( n ) nor m f → d 1 / 2 · A ( d − 1 / 2 f ) a.s. Pr o of. F o r any two functions φ 1 , φ 2 : M → R , define g u ( φ 1 , φ 2 ) = ( φ 1 ( · ) , f 1 ( · ) φ 2 ( · )). W e note that g u is a co ntin uous mapping in the L ∞ top ology and ( d n , c ′ n L n u f ) = g u ( d n , c n L r w f ) . By the contin uous mapping theorem, if d n → d a.s. and c n L ( n ) r w f → Lf a .s . in the then c ′ n L ( n ) u → d · Lf . Thu s, co nv ergence of the random walk La placians implies conv ergence of the unnormalized La placian under the very weak condition of co nv ergence of the degree op era tor to a b ounded function. Conv ergence o f the normalized Laplacia n is slightly trickier. W e may wr ite the nor malized Laplacian as L ( n ) nor m f = d 1 / 2 n L ( n ) r w ( d − 1 / 2 n f ) (11) = d 1 / 2 n L ( n ) r w ( d − 1 / 2 f ) + d 1 / 2 n L ( n ) r w ( d − 1 / 2 n − d − 1 / 2 ) f ) . (12) Using the contin uous mapping theorem, we see tha t conv erg ence of the nor- malized La placian, c n L ( n ) nor m f → d − 1 / 2 L r w ( d − 1 / 2 f ), is equiv alent to showing c n L ( n ) r w (( d − 1 / 2 n − d − 1 / 2 ) f ) → 0 . A T aylor expa nsion of the inv ers e square ro ot gives that showing c n L ( n ) r w ( d n − d ) → 0 is sufficient to prove conv ergence . W e now verify co nditions which will ensure that the degr ee op er ators will conv erge at the appropria te rate. W e further decomp ose the empirical degree op erator into the bia s E d n − d a nd empiric al err or d n − E d n . Simply carr ying out the T aylor ex pansions to higher order terms in the calculation of the degree function d n in Eq. 6, and using the r efined calculation of the zeroth mo ment in lemma 8 in the a ppe ndix , the bias of the degree o pe rator is d n − d = h 2 n b + o ( h 2 n ) for some uniformly bo unded, co ntin uous function b . Thu s we ha ve, c n L ( n ) r w ( d n − d ) = c n h 2 n k ( I − P n ) b k ∞ + o (1) = o (1) (13) since c n h 2 n is constant and k ( I − P n ) φ k ∞ → 0 for any contin uous function φ . W e a lso need to check that the empir ic al erro r k d n − E d n k ∞ = O ( h 2 n ) a .s.. If nh m +4 n / log n → ∞ then using the Be rnstein b ound in equation 1 0 with ǫ replaced by h 2 n and a pplying Bor el-Cantelli gives the desir ed result. 12 2.7 Limit as w eigh ted Laplace-Beltrami oper ator Under some r egularity conditions, the limit given in the main theore m (Theo rem 3) yields a weight ed La place-Beltra mi op er ator. F or convenience, define γ ( x ) = r x ( x ), ω ( x ) = w x ( x ). Corollary 5. Assume the c onditions of The or em 3 and let q = p 2 ω γ m +2 . If r x ( y ) = r y ( x ) , w x ( y ) = w y ( x ) for al l x, y ∈ M and r ( · ) ( · ) , w ( · ) ( · ) ar e twic e differ en tiable in a neighb orho o d of ( x, x ) for al l x , then for c ′ n = Z K 0 ,m /h m +2 − c ′ n L ( n ) u → q p ∆ q . (14) Pr o of. No te that ∇| y = x γ ( y ) = 2 ∇| y = x r x ( y ). The result fo llows from a ppli- cation of Theor em 3, Cor r ollary 4, and the definition of the weighted Laplace- Beltrami op er ator. 3 Application to Sp ecific Graph Cons tru c tions T o illustrate T he o rem 3, we a pply it to calculate the asymptotic limits of g r aph Laplacians for several widely used gra ph constructio n metho ds. W e als o apply the gener al diffusion theory framework to a na lyze LLE. 3.1 r -Neigh b orho o d and Kernel Graphs In the cas e of the r -neighbor ho o d graph, the Laplacian is co nstructed us ing a kernel with fixed ba ndwidth and normaliza tion. The bas e kernel is s imply the indica tor function K 0 ( x ) = I ( | x | < r ). The radius r x ( y ) is co nstant s o ˙ r ( x ) = 0. The drift is given by µ s ( x ) = ∇ p ( x ) /p ( x ) and the diffusion term is σ s ( x ) σ s ( x ) T = I . The limit op era tor is thus 1 2 ∆ M + ∇ p ( x ) T p ( x ) ∇ = 1 2 ∆ 2 as exp ected. This ana ly sis also holds for a r bitrary kernels of b ounded v ar iation. One may also introduce the usua l weigh t function w ( n ) x ( y ) = d n ( x ) − α d n ( y ) − α to obtain limits of the for m 1 2 ∆ p 2 − 2 α ) . These limits match those obtained by Hein et al. (20 0 7) and L a fon (200 4) for smo o th kernels. 3.2 Directed k-Nearest Neighbor Graph F or kNN-graphs, the base k ernel is still the indicator kernel, and the weight function is c o nstant 1. How ever, the bandwidth function r ( n ) x ( y ) is random and depe nds on x . Since the gr aph is dir ected, it do es no t dep end on y so ˙ r x = 0. By the analysis in section 3.4, r x ( x ) = cp − 1 /m ( x ) for so me constant c . Con- sequently the limit op e rator is prop o r tional to 1 p 2 /m ( x ) ∆ M + 2 ∇ p T p ∇ = 1 p 2 /m ∆ p 2 . 13 Note that this is gener ally not a s e lf-adjoint o p erator in L ( p ). The symmetriza- tion of the gr aph has a no n-trivial affect to ma ke the gra ph Laplacian self- adjoint. 3.3 Undirected k -Nearest Neigh b or Graph W e consider the O R-construction where the no des v i and v j are link ed if v i is a k - nearest neigh b or of v j or vice-versa. In this case h m n r ( n ) x ( y ) = max { ρ n ( x ) , ρ n ( y ) } where ρ n ( x ) is the distance to the k th n nearest neighbor of x . The limit bandwith function is non-differentiable, r x ( y ) = max { p − 1 /m ( x ) , p − 1 /m ( y ) } , but a T aylor- like ex pa nsion exis ts with ˙ r x ( x ) = 1 2 m ∇ p ( x ) T p ( x ) . The limit op er ator is 1 p 2 /m ∆ p 1 − 2 /m . which is s elf-adjoint in L 2 ( p ). Sur pr isingly , if m = 1 then the kNN graph construction induces a drift away fro m high densiy regions. 3.4 Conditions for kNN con v ergence T o complete the ana lysis, we m ust chec k the conditions for kNN graph construc- tions to satisfy the assumptions of the main theorem. This is a str a ightforw ard application o f existing uniform consistency results for kNN density estimation. Let h n = k n n 1 /m . The condition we must v erify is sup y ∈M r ( n ) x − r x ∞ = O ( h 2 n ) a.s. W e c heck this for the directed kNN g raph, but analyses for o ther kNN gr aphs are similar. The kNN density estimate of Loftsgaa r den & Quesenber ry (196 5) is ˆ p n ( x ) = V m n ( h n r ( n ) x ( x )) m (15) where h n r ( n ) x ( x ) is the dista nce to the k th nearest neighbor of x given n data po ints. T aylor expanding equation 15 shows that if k ˆ p n − p k ∞ = O ( h 2 n ) a.s. then the requirement o n the locatio n dep endent bandwidth fo r the main theorem is satisfied. Devroy e & W agner (1977)’s pro o f for the uniform consistency of kNN density estimation may b e easily mo dified to show this. T ake ǫ = ( k n /n ) 2 in their pro of. One then sees that h n = k n /n → 0 and nh m +2 n log n = k 2+2 /m n n 1+2 /m log n → ∞ are sufficient to achiev e the des ired bo und on the er r or. 14 3.5 “Self-T uning” Graphs The form of the kernel used in self-tuning gr aphs is K n ( x, y ) = e x p − k x − y k 2 σ n ( x ) σ n ( y ) ! . where σ n ( x ) = ρ n ( x ), the distance betw een x and the k th nearest neighbor . The limit bandwidth function is r x ( y ) = p p − 1 /m ( x ) p − 1 /m ( y ). Since this is twice differentiable, coro llary 5 g ives the as y mptotic limit, which is the s ame as for undirected kNN graphs, p − 2 /m ∆ p 1 − 2 /m . 3.6 Lo cally Linear E m b edding Lo cally linea r embedding (LLE), introduced by Row eis & Sa ul (2 000), has b een noted to b ehav e like (the square of ) the La place-Beltra mi o pe rator Belkin & Niyogi (2003). Using our kernel-free framework w e will show how LLE differs fro m weigh ted Laplace-Beltr ami o p erators a nd gr aph Laplacians in several wa ys. 1) LLE has, in general, no wel l-define d asymptotic limit without additional conditions on the weigh ts. 2) It can only b ehave like a n unweighte d Laplace-B eltrami op era tor. 3) It is a ffected by the curv ature o f the manifold, and the cur v ature can cause LLE to not b ehav e like a ny elliptic op er ator (including the La place-Beltra mi op erator ). The key o bs erv ation is that LLE only controls for the drift term in the extrinsic co ordinates . Thus, the diffusion ter m has freedom to v ary . How ever, if the manifold has curv ature, the drift in extrinsic co ordinates constra ins the diffusion ter m in nor mal co o r dinates. The LLE matrix is defined a s ( I − W ) T ( I − W ) wher e W is a weigh t matrix which minimizes r econstruction error W = arg min W ′ k ( I − W ′ ) y k 2 under the constraints W ′ 1 = 1 and W ′ ij 6 = 0 only if j is one of the k th nearest neighbo rs of i . T ypically k > m and re c o nstruction er ror = 0. W e will analyze the matrix M = I − W . Suppo se LLE pro duces a sequence of matrice s M n = I − W n . The r ow sums of M n are 0. Th us, we may deco mp o se M n = A + n − A − n where A + n , A − n are genera tors for finite state Markov pro cesses o btained from the p ositive and negative weigh ts resp ectively . Assume tha t there is some scaling c n such that c n A + n , c n A − n conv erge to genera tors of diffusion pro c e sses with dr ifts µ + , µ − and diffusion ter ms σ + σ T + , σ − σ T − . Set µ = µ + − µ − and σ σ T = σ + σ + − σ − σ − . No well-defined limit. W e first show there is g enerally no well-defined asymptotic limit when one simply minimizes r econstruction erro r. Suppo se rank ( L x ) < m ( m + 1) / 2 at x . This will ne c e ssarily be true if the extrinsic dimension b < m ( m + 1) / 2 + m . F or s implicit y ass ume r ank ( L x ) = 0. Mini- mizing the LLE reconstr uction er ror do es not constrain the diffusion term, and σ ( x ) σ ( x ) T may be chosen arbitrarily . Cho o se asymptotic diffusion σ σ T and drift 15 µ terms that are Lipschitz so that a cor resp onding diffusion pro cess necessa rily exists. A diffusion with terms 2 σ σ T and µ will als o exist in that ca se. One may easily co nstruct gr aphs for the p ositive and nega tive weigh ts with these asymptotic diffusion and drift ter ms b y solving highly underdeter mined quadratic prog rams. F urthermor e, in the int erio r of the manifold, these gr aphs may b e constructed so that the finite sample drift terms a re exactly equal by adding an a dditional constraint. Thus, A + n → 2 G 0 + µ T ∇ and A − n → G 0 + µ T ∇ where G 0 is the g enerator for a diffusion pro cess with zero dr ift and diffusion term σ − ( x ) σ − ( x ) T . W e have c n M n = A + n − A − n → G 0 . Thus, we ca n construct a sequence of LLE matr ices that hav e 0 recons truction error but have an ar bitrary limit. It is trivial to see how to mo dify the constructio n when 0 < rank ( L x ) < m ( m + 1) / 2. No drift. Since µ s ( x ) = 0, if the LLE matrix do es be hav e like a La place- Beltrami op er ator, it must b ehav e like an unw eighted one, and the density has no affect on the drift. Curv ature and limit. W e now sho w that the curv ature of the ma nifold affects LLE and that the LLE ma tr ix may not b ehav e like any elliptic o p erator . If the ma nifold has s ufficient cur v ature, namely if the ex trinsic co or dinates have dimension b ≥ m + m ( m + 1) / 2 a nd ra n k ( L x ) = m ( m + 1) / 2, then the diffusion term in the normal co ordinates is fully constra ined by the drift term in the extrinsic co o rdinates. Recall from equation 1 that the extrinsic co ordina tes as a function of the normal co o rdinates ar e y = x + H x s + L x ( ss T ) + O ( k s k 3 ). By linearity of H x and L x , the asymptotic dr ift in the ex trinsic co or dinates is µ ( x ) = H x µ s ( x ) + L x ( σ s ( x ) σ s ( x ) T ). Since reconstruction error in the extr insic coo rdinates is 0, w e hav e in normal co ordinates µ s ( x ) = 0 and L x ( σ s ( x ) σ s ( x ) T ) = 0 . In other w ords , the a s ymptotic drift a nd diffusion ter ms o f A + n and A − n m ust be the sa me, and c n M n → G 0 − G 0 = 0. This implies that the scaling c n where LLE can be exp ected to b ehave like an elliptic op erator gives the trivial limit 0. If a nother scaling yields a non-triv ial limit, it may include higher -order differe ntial terms. It is easy to see when L x is not full r ank, the curv ature affects LLE by partially constraining the diffusion term. Regularization and LLE. W e note that while the LLE framework of mini- mizing r econstruction error can yie ld ill-behaved solutions, practical implemen- tations add a regular ization ter m when constructing the weights. This causes the re c o nstruction er ror to b e non-ze r o in genera l and gives unique solutions for the weigh ts w hich favor equa l weigh ts (and a symptotic b ehavior like kNN graphs). 16 −2 0 2 −1 0 1 0.05 0.1 0.15 (A) Gaussian Manifold −0.06 −0.04 −0.02 0 0.02 0.04 0.06 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 (B) Kernel Laplacian embedding −0.05 0 0.05 −0.05 0 0.05 (C) Raw kNN Laplacian Embedding −0.06 −0.04 −0.02 0 0.02 0.04 0.06 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 (D) rescaled kNN Laplacian Embedding Figure 1: (A) shows a 2D manifold where the x and y co or dinates are drawn from a tr uncated standar d normal distribution. (B- D) show embeddings using differ- ent gra ph constr uctions. (B) uses a normalized Ga ussian kernel K ( x,y ) d ( x ) 1 / 2 d ( y ) 1 / 2 , (C) uses a kNN gr aph, and (D) uses a kNN gr aph with edge weights p ˆ p ( x ) ˆ p ( y ). The bandwidth fo r (B) was chosen to b e the median standa rd deviation fro m taking 1 step in the kNN gra ph. 4 Exp erimen ts T o illustr a te the theor y , we s how how to corr ect the bad b ehavior o f the kNN Laplacian for a synthetic data set. W e also show how o ur analys is can predict the sur pr ising b ehavior of LLE . kNN Laplacian. W e consider a non-linear embedding exa mple which almo st all non-linear embedding techniques handle well but the kNN gr aph La placian per forms p o o rly . Figur e 1 shows a 2D manifold embedded in 3 dimensions a nd embeddings using different gr aph constructions. The theoretical limit o f the normalized Laplacian L knn for a k NN graph is L knn = 1 p ∆ 1 . while the limit for a g raph with Gaussian weigh ts is L gauss = ∆ p . The firs t 2 co o r dinates of ea ch po int are from a truncated standard normal distribution, so the density at the bo undary is small and the effect of the 1 /p term is substantial. This yields the bad b ehavior s hown in Figur e 1 (C). W e may use the relatio ns hip b etw een the k th -nearest neig hbor a nd the density in Eqn (15 ) to obtain a pilot estimate ˆ p of the densit y . Cho os ing w x ( y ) = p ˆ p n ( x ) ˆ p n ( y ), gives a weigh ted k NN g raph with 17 −2 0 2 −2 0 2 −1 −0.5 0 0.5 1 (A) Toroidal helix −0.05 0 0.05 −0.05 0 0.05 (B) Laplacian −2 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 1.5 (C) LLE w/ regularization 1e−3 −1 0 1 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 (D) LLE w/ regularization 1e−9 Figure 2: (A) shows a 1 D ma nifold isometr ic to a circle. (B- D) show the em- bedding s using (B) Laplacian eigenmaps whic h corr ectly iden tifies the structure, (C) LLE with reg ularizatio n 1e- 3, and (D) LLE with re gulariza tion 1e -6. the s a me limit as the graph with Gaus sian weigh ts. Fig ure 1 (D) shows that this change yields the r oughly desired b ehavior but with fewer “ holes” in low density r e gions and more in high density regio ns . LLE. W e c o nsider ano ther sy nt hetic data set, the toroida l helix, in which the manifold str ucture is ea sy to recover. Fig ure 2 (A) shows the manifold which is clearly isometric to a circle, a fact pic ked up b y the kNN Laplacian in Figur e 2 (B). Our theory predicts that the heuristic a rgument that LLE b ehaves like the Laplace-Beltr ami op erator will n ot hold. Since the total dimension for the drift and diffusion terms is 2 a nd the glo bal co ordinates a lso have dimension 2, that there is fo r ced cancella tio n of the first and second o rder differential terms a nd the op erator should b ehav e lik e the 0 op era tor or include higher order differentials. In Fig ure 2 (C) and (D), we see this that LLE p er forms po orly and that the behavior comes closer to the 0 op er ator when the r egulariza tion term is smaller. 18 5 Remarks and Discussion 5.1 Non-shrinking neigh borho o ds In this pap er , we hav e presented con vergence results using r e sults for diffu- sion pro c esses without jumps. Graphs constructed using a fixed, no n-shrinking bandwidth do not fit within this framework, but approximation theor e ms for diffusion pro cesses with jumps still apply (see Jaco d & ˇ Sirjaev (20 0 3)). Ins tead of being characteriz e d by the drift a nd diffusion pa ir µ ( x ) , σ ( x ) σ ( x ) T , the in- finitesimal g enerator s for a diffusion pro cess with jumps is c hara cterized by the “Lˆ evy-Khintc hine” tr iplet consisting of the drift, diffusion, and “Lˆ evy mea- sure.” Given a sequence of transition kernels K n , the additional req uir ement for co nv ergence of the limiting pro cess is the existence of a limiting tra nsition kernel K such tha t R K n ( · , dy ) g ( y ) dy → R K ( · , dy ) g ( y ) dy lo ca lly uniformly for all C 1 functions g . This establishes an impo ssibility result, that no metho d that only assigns po sitive mass o n shrink ing neighborho o ds ca n hav e the same gra ph Laplacian limit as a a kernel co nstruction metho d wher e the bandwidth is fixed. 5.2 Con v ergence rates W e no te that o ne missing element in our analysis is the deriv a tion of convergence rates. F or the main theor em, we note that it is, in fact, not necessary to apply a diffusion approximation theorem. Since our theorem still uses a kernel (alb eit one with m uch weaker conditio ns ), a virtually identical pro o f can b e obta ined by applying a function f and T aylor expanding it. Thus, we b elieve that s imilar conv ergence rates to Hein et a l. (2007) can b e obtained. A lso, while our co n- vergence res ult is stated for the stro ng op erator top ology , the same conditions as in Hein g ive weak con vergence. 5.3 Relation to densit y estimation The connection betw een kernel density estimation a nd g r aph Laplacians is obvi- ous, namely , any kernel density estimation metho d using a non-neg ative kernel induces a random walk graph Lapla cian and v ice versa. In this pap er, we have shown that as a consequence of iden tifying the asymp- totic degree term, we have shown c o nsistency of a wide class o f ada ptive k ernel density es timates o n a manifold. W e also hav e shown that on co mpact sets , the the bias term is unifor mly b ounded by a term of order h 2 , and a small mo difi- cation to the Berns tein b o und (Eqn 10) g ives that the v ariance is bo unded by a term of or der h − m . Both of which o ne would expec t. This g eneralizes previous work o n manifold density estimation by Pelletier (2005) a nd Oza kin (20 09) to adaptive kernel density estimation. The well-studied field of kernel density estimatio n may also lea d to insig hts on how to ch o os e a g o o d lo catio n depe ndent bandwidth as well. W e c ompare the fo r m of our density es tima tes to other well-known a daptive kernel density estimation techniques. The ba llo on estimator and sample s mo othing estimators 19 as descr ib e d by T e r rell & Scott (1 9 92) are resp ectively given by ˆ f 1 ( x ) = 1 nh ( x ) d X i K k x i − x k h ( x i ) (16) ˆ f 2 ( x ) = 1 n X i 1 h ( x i ) d K k x i − x k h ( x i ) . (17) In the univ ariate case, T er rell & Scott (19 92) show that the ballo on e sti- mators yield no improvemen t to the asymptotic ra te of co nvergence over fixed bandwidth density estimates. The sample smo othing estimator gives a density estimate which do es not necessarily integrate to 1. How ev er, it can exhibit better asymptotic b ehavior in some cases . The Abramso n squar e r o ot law es- timator (Abramson, 1982) is a n exa mple of a sa mple smo othing estimator and takes h ( x i ) = hp ( x i ) − 1 / 2 . On compact interv als, this estimator has bias of order h 4 rather than the usual h 2 (Silverman , 1998), and it achieves this bias reduction without reso rting to higher order kernels, which neces sarily negative in so me regio n. How ever, the bias in the tail for univ ariate Gaussian data is of order ( h/ log h ) 2 (T errell & Scott, 19 92), which is only margina lly b etter than h 2 . While we do not make claims o f be ing able to re duce bias in the c ase of den- sity estimation a manifold, in fact, we do not b elieve bias reductio n to the order of h 4 is p o s sible unles s o ne ma kes some use of manifold curv ature informa tion, the existing density es tima tio n literature sugg ests what p otential benefits one may a chiev e ov er different reg ions of a density . 5.4 Eigen v alues/Eigen v ectors Fixed bandwidth case W e find our lo ca tion dependent bandwidth results to be of interest in the co ntext of the nega tive result in von Luxburg e t al. (20 0 8) for unnormalized Laplacia ns with a fixed bandwidth. Their r esults s ta te that for unnormalized graph Laplacians, the eigenv ectors of the discrete approxima- tions do no t co nv erge if the corresp o nding eig e nv alues lie in the ra nge of the asymptotic degree o p erator d ( x ), whereas for the no rmalized Laplacian, the “de- gree op era tor” is the iden tity and the eigenv ectors conv erge if the co rresp o nding eigenv alues s tay aw ay from 1. Our results suggest that even with unnorma lized Laplacians, one c an obtain co nv ergence o f the eige nvectors by manipulating the range of the degr ee op era tor through the use of a lo cation dep endent bandwidth function. F or example, with kNN graphs we have that the deg ree op era tor is essentially 1. F or s elf-tuning gra phs, the degree op erato r also conv erg es to 1 , and since the kernels form an equico ntin uous family of functions, the theory for compact integral op erator s may be rigo rously applied when the bandwidth scaling is fixed. Thu s we can obtain unnormalized and normalized graph La placians that (1) have s pe ctra tha t conv erges for fixed (non-decr easing) bandwidth s c alings and (2) co nv erge to a limit that is different from that of prev io usly analyzed normalized Laplac ia ns when the bandwidth decreases to 0. 20 Corollary 6. A ssume the standar d assumptions. F urther assu me that for some h 0 > 0 , n K 0 k y − x k h : h > h 0 o form an e quic ontinuous family of functions. L et q , g ∈ C 2 ( M ) b e b ounde d away fr om 0 and ∞ . Set γ = r q pg r x ( y ) = p γ ( x ) γ ( y ) (18) ω = pg q m/ 2 g p w x ( y ) = p ω ( x ) ω ( y ) . (19) If h n = h 1 for al l n , then the eigenve ctors of the normalize d L aplacia ns c onver ge in the sense given in von Luxbu rg et al. (2008). If h n ↓ 0 satisfy the assumptions of the or em 3, then the limit r esc ale d de gr e e op er ator is d = g and − c n L nor m f → g − 1 / 2 q p ∆ q ( g − 1 / 2 f ) (20) which induc es the smo othness fu n ctional f , g − 1 / 2 q p ∆ q ( g − 1 / 2 f ) L 2 ( p ) = D ∇ ( g − 1 / 2 f ) , ∇ ( g − 1 / 2 f ) E L 2 ( q ) . (21) Pr o of. As s ume the h n ↓ 0 cas e. Use c o rollar y 5 a nd solve for ω and γ in the s y stem of equations : q = p 2 ω γ m +2 , g = pω γ m . In the h n = h 1 case, the conditions s atisfy those g iven in von Luxburg et a l. (2008) with the mo dification that the k ernel is not bo unded a wa y from 0 and the additional a ssumption tha t p is bo unded aw ay from 0. Thus, the asymptotic degree op erator d is b o unded aw ay from 0, and the pro ofs in von Luxburg et al. (20 08) remain unc hanged. W e note that the restr iction to an equico ntin uous family o f k ernel functions excludes kNN gra ph constructions. How ever, one may get aro und this by con- sidering the t wo-step tra nsition kernels K 2 ( x, y ) = K ( x, · ) ∗ K ( · , y ), where ∗ denotes the convolution o p erator with r esp ect to the underlying density . F or in- dicator kernels like those used in kNN graph cons tructions, K 2 will b e Lipschitz and hence for m an eq uic o ntin uous fa mily . Thus, if one handles the p otential issues with the random bandwidth function, one may apply the theory of com- pact in tegra l op erator s to obtain conv ergence of the sp ectrum and eigenv ectors for kNN graph Laplacia ns when k gr ows appropr iately . 5.5 Reasons for choosing a graph construction metho d W e highlig ht how our more g eneral kernel can yie ld adv antageous prop erties. In particular, it yields graphs co nstructions where o ne can (1) cont ro l the spa rsity of the Laplacia n matrix, (2) control connectivity prop erties in low density r e - gions, (3) give a symptotic limits that c annot b e attained using previous gra ph construction metho ds , and (4) give Laplacians with go o d spec tr al prop erties in the non-s hrinking bandwidth case. 21 One way to control (1) and (2) is to make the bina ry choice of us ing kNN or a kernel with uniform bandwidth to construct the g r aph. Our r esults show that, by using a pilot estimate of the density , one can obtain spar sity and co nnectivity prop erties in the co nt inuum b e t ween these tw o choices. F or (3 ) a nd (4), we no te that the limits for previously analyz ed unnormal- ized La placians were of the form p α − 1 ∆ p α f . Using corollar y 5 , one see that limits of the form q p ∆ q for any smo oth, b ounded density q o n the manifold can b e obta ined. Equiv alently , one can approximate the smo othness functional k∇ f k 2 L 2 ( q ) for any almost a ny q , not just p α . F or norma lized Laplacia ns , which have go o d sp ectr a l prop erties , the previ- ously known limits induced smo othness functionals of the form ∇ ( p (1 − α ) / 2 f ) 2 L 2 ( p α ) . With o ur mo re general k ernel and a ny g , q ∈ C 2 ( M ), we may induce a smoo th- ness functional of the form k∇ ( g f ) k 2 L 2 ( q ) . In pa rticular, in the interesting case where g = 1 and the smo othness functional is just a nor m on the gra die nt of f , i.e. k∇ f k 2 L 2 ( q ) , q may be chosen to b e almost any density , not just q = p 1 . 6 Conclusions W e hav e in tro duced a general framework that enables us to analyze a wide class o f graph La placian co nstructions. O ur fra mework r educes the problem of graph Laplacia n analysis to the calculation of a mean and v ariance (or drift and diffusion) for any g raph construction metho d with po sitive weigh ts and shrinking neighbo r ho o ds. Our ma in theore m ex tends existing s trong op er ator conv ergence results to non-smo oth kernels, and in tro duces a general lo cation- depe ndent ba ndwidth function. The a nalysis o f a locatio n-dep endent bandwidth function, in particular , significa ntly ex tends the family of graph constructio ns for which an asymptotic limit is kno wn. This family includes the pr eviously unstud- ied (but commo nly use d) kNN graph constructions, unw eighted r - neighborho o d graphs, and “ self-tuning” g raphs. Our re sults also hav e practical sig nificance in g raph constructions as they suggest graph constructions that (1 ) can pro duce spars er graphs than those constructed with the usua l kernel metho ds, despite having the same asymptotic limit, and (2) in the fixed ba ndwidth re g ime, pro duce norma liz e d Laplacia ns that have well-be hav ed sp ectra but conv erge to a differ ent class of limit op era- tors than pr eviously studied normaliz ed Laplacia ns. In particula r , this class of limits include those that induce the smo o thness functional k∇ f k 2 L 2 ( q ) for almo s t any densit y q . The g raph constr uctions may a ls o (3 ) hav e b etter connectivity prop erties in low-densit y reg ions. 7 Ac kno wledgemen ts W e would like to thank Martin W ainwright and Bin Y u for their helpful com- men ts, and our anonymous reviewers for ICML 201 0 for the detailed and helpful review. 22 References Abramson, I.S. On bandwidth v ariation in kernel estimates -a square ro o t law. The Annals of Statistics , 10(4):121 7–12 2 3, 19 82. Belkin, M. and Niyogi, P . Lapla cian eigenma ps for dimensionality reductio n and data r epresentation. Neu r al Computation , 1 5(6):137 3–13 96, 2 0 03. Belkin, M. a nd Niyogi, P . Semi-sup er vised lear ning on Riemannian manifolds . Machine L e arning , 56:209 –239 , 2004 . Belkin, M. and Niyogi, P . T ow ards a theoretical foundation for Lapla cian-based manifold metho ds. COL T , 20 05. Belkin, M. a nd Niyogi, P . Con vergence of La placian eigenma ps. I n NIPS 19 , 2006. Bo othb y , W. M. An In tr o duction t o D iffer entiable Manifolds and Riemannian Ge ometry . Academic P ress, 198 6. Bousquet, O., Chap elle, O., a nd Hein, M. Measur e based regula rization. In NIPS 16 , 2003. Devroy e, L.P . and W agner , T.J . The strong uniform consis tency of nearest neighbor density estimates. The Annals of Statistics , pp. 536 –540 , 1977 . Ethier, S. and Kurtz, T. Markov Pr o c esses: Char acterization and Conver genc e . Wiley , 19 86. Gin ´ e, E. and Koltchinskii, V. Empirical graph Laplacian a ppr oximation of Laplace-Beltr ami o p erator s: larg e sample results. In 4th International Con- fer enc e on High D imensional Pr ob abil ity , 2 005. Grigor’yan, A. Heat kernels on w eighted manifolds and applicatio ns. Cont. Math , 398 :9 3–19 1, 20 0 6. Hein, M., Audib ert, J.-Y., and von Lux bur g, U. Gra ph Laplacians and their conv ergence on rando m neighborho o d graphs. J MLR , 8:13 2 5–13 70, 2007 . Hein, Matthias, yves Audib ert, Jean, and Luxbur g, Ulrike V on. F ro m graphs to manifolds - weak and str ong p oint wise consistency of gra ph laplacia ns. In COL T , 2 005. Jaco d, J. and ˇ Sirjaev, A. N. Limit The or ems for Sto chastic Pr o c esses . Springe r , 2003. Kallenberg , O. F oundations of Mo dern Pr ob ability . Springer V er lag, 20 02. Kannan, R., V empala, S., and V etta, A. On cluster ings: Go o d, bad and sp ectra l. Journal of t he ACM , 51(3):497– 515, 2004. 23 Lafon, S. Diffusion Maps and Ge ometric Harmonics . PhD thesis, Y ale Univ er- sity , CT, 200 4. Loftsgaar den, D.O. and Quesenber ry , C.P . A nonpara metric estimate of a mul- tiv a riate density function. The Annals of Mathematic al St atistics , 36(3):10 49– 1051, 19 65. Maier, M., von Luxburg , U., a nd Hein, M. Influence o f gra ph construction o n graph-ba s ed cluster ing measur es. In NIPS 21 , 20 08. Nadler, B., La fon, S., Coifman, R., a nd Ke v rekidis, I. Diffusion maps, sp ectra l clustering and reaction c o ordinates o f dynamical sy stems. In Applie d and Computational Harmonic Analysi s , 2006 . Ozakin, A. Submanifold dens it y estimation. In Ad vanc es in Neur al Information Pr o c essing Syst ems 22 (N IPS) , 200 9. Pelletier, Bruno. Ker nel density estimatio n on riemannian manifolds. Statistics and Pr ob ability L etters , 7 3(3):297 – 3 04, 2 005. ISSN 0 167- 7 152. Row eis, S. T. and Sa ul, L. K. Nonlinea r dimensionality reduction by lo cally linear embedding. Scienc e , 290(5 500):23 23, 200 0 . Silverman, B.W. Dens ity estimation for statistics and data analysis . Chapma n & Hall/CRC, 1998 . Singer, A. F ro m gr a ph to manifold Laplac ia n: the conv ergence ra te. Applie d and Computational Harmonic Analysis , 21(1):12 8 –134 , 2006 . T errell, G.R. and Scott, D.W. V a riable kernel density estimation. The Annals of Statistics , 20(3):123 6–12 65, 1 992. von Luxburg , U., Belkin, M., and Bousquet, O. Consistency of sp ectra l cluster- ing. Annals of Statistics , 36(2):55 5–586 , 2 008. Zelnik-Manor , L. and Perona, P . Self-tuning sp ectr a l cluster ing. In NIPS 17 , 2004. Zhu, X., Ghahrama ni, Z., and Laffer ty , J. Semi-sup erv is ed lea rning us ing Ga us - sian fields and har monic functions. In ICML , 2 003. 8 App endix 8.1 Main lemma Lemma 7 (Integration with lo cation dep endent bandwidth) . L et 1 b e the indi- c ator function and h > 0 b e a c onstant. L et r x b e a lo c ation dep endent b andwidth 24 function that satisfies the st andar d assumptions, i.e. it has a T aylor-like exp an- sion ˜ r x ( y ) = r x ( x ) + ( ˙ r x ( x ) + α x sign( u T x s ) u x ) T s + ǫ r ( x, s ) . L et V m = π m/ 2 Γ ( m 2 +1 ) b e the volume of the u nit m –spher e. Then M 0 = 1 V m h m Z 1 k y − x k ˜ r x ( s ) < h ds = r x ( x ) m + h 2 ǫ 0 ( x, h ) M 1 = 1 V m h m Z s 1 k y − x k ˜ r x ( s ) < h ds = h 2 r x ( x ) m +2 ˙ r ( x ) + h 3 ǫ 1 ( x, h ) M 2 = 1 V m h m Z ss T 1 k y − x k ˜ r x ( s ) < h ds = 2 h 2 m + 2 r x ( x ) m +2 I + h 3 ǫ 2 ( x, h ) wher e s up x ∈M ,h 0 . Pr o of. L et v ( s ) = ˙ r ( x ) + sign( s T u x ) αu x . W e will show that the set on which the indicator function is approximately a spher e s hifted by v / r x ( x ) with radius hr x ( x ). 1 k y − x k r x ( s ) < h = 1 k s k 2 + L ( ss T ) 2 < h 2 ( r x ( x ) + v ( s ) T s + O ( k s k 2 )) 2 = 1 k s k 2 < h 2 r x ( x ) 2 (1 + 2 v ( s ) T s + O ( h 2 )) = 1 k s k 2 − 2 h 2 v ( s ) T s r x ( x ) + h 4 v ( s ) T v ( s ) r x ( x ) 2 < h 2 r x ( x ) 2 + O ( h 4 ) = 1 s − v ( s ) r x ( x ) < hr x ( x ) + h 3 δ x ( s ) for some function δ x ( s ). F urthermo re, the ass umptions on the b ounded c ur v a- ture of the manifo ld and uniform b ounds on the bandwidth function remainder term ǫ r ( x, s ) give that the p ertur bation term δ x ( s ) may be uniformly b o unded by sup x ∈M | δ x ( s ) | ≤ C δ ( k s k 2 ) for some cons tant C δ . The r esult for the zeroth moment follows immediately from this. The re s ults for the first and second mo ments we calculate in lemma 1 0. 8.1.1 Refine d analysis of the zeroth moment F or conv ergence of the normalized Lapla c ia n, we need a more refined result fo r the zer o th moment. Lemma 8. Assume ˜ r x ( y ) = r x ( s ) + ǫ r ( x, s ) . 25 wher e r x ( s ) is twic e c ontinuously differ entiable as a function of x and s and and ǫ r is b ounde d. Then Z 1 V m h m 1 k y − x k ˜ r x ( s ) < h ds = r x ( x ) m + h 2 b ( x ) + h 2 ǫ 0 ( x, h ) wher e b is c ontinuous and sup x | ǫ 0 ( x, h ) | → 0 as h → 0 . Pr o of. W e first s ketc h idea b ehind the pro of a nd leav e the details to in terested readers. One may conv ert the integral in normal co ordina tes to an integral in po lar co or dinates ( R , θ ). One may then apply the implicit function theorem to obtain that the unp erturb ed r adius function R is a t wice co nt inuously differen- tiable function of h . This gives a T aylor expansion of the ze roth moment with resp ect to h . ǫ r ( x, s ) gives the des ired res ult. W e may express the integral for the zero th moment in p o lar co o rdinates Z x ( h ) = R 1 V m h m 1 k y − x k ˜ r x ( s ) < h ds = R R x ( θ, h ) dµ θ where µ θ is the unifor m measure on the surfac e of the unit m - sphere and ˜ s = s/h = R x ( θ, h )) θ so lves the equa tio n k ˜ s k 2 + L ( ˜ s ˜ s T ) = r x ( x ) + h ∇ r x ( x ) T ˜ s + h 2 ˜ s T H r x (0) ˜ s 2 . and H r x (0) is the Hessian of r x ( · ) e v aluated at 0. By the implicit function theorem, the s olutions ˜ s define a twice contin uously differentiable function of x, h . F or s ufficie ntly small h ≥ 0, ˜ s is b ounded aw ay from 0 since r x is bo unded awa y from 0 and k s/h k is bo unded aw ay from ∞ by the b ound in lemma 7. Thus, R x ( θ, h ) and Z x ( h ) are twice contin uously differentiable with b ounded second deriv atives. Z x ( h ) then has a second-or der T aylor expa ns ion Z x ( h ) = Z x (0) + Z ′ x (0) h + Z ′′ x (0) h 2 + o ( h 2 ). By the less r efined a na lysis in lemma 7, we hav e that Z x (0) = r x ( x ) m and Z ′ x (0 + ) = 0. O ne may a pply a squeeze theorem to obtain that the contribution of the error term ǫ r ( x, s ) to the zeroth moment is b ounded by C r sup x,s | ǫ r ( x, s ) | for some constant C r , and the res ult follows. 8.2 Momen ts of the in dicator k ernel / In tegrating o v er the centered sphere in normal co ordinates Here w e ca lc ulate the fir st thr ee moments of the norma lized indica tor kernel where V m = R 1 ( k u k < 1) du = R S m du is the volume of the m -dimensional unit sphere in Euclidean s pace. Lemma 9 (Moments for the sphere) . L et K ( k s k /h ) = 1 h m V m 1 ( k s k < h ) . Then 26 the first t wo moments ar e given by: M 0 = Z K ( k s k /h ) ds = 1 h m V m Z S m ds = 1 + O ( h 3 ) M 1 = Z sK ( k s k /h ) ds = 1 h m V m Z S m sds = 0 + O ( h 4 ) M 2 = Z ss T K ( k s k /h ) ds = 1 h m V m Z S m ss T ds = 1 m + 2 1 + O ( h 4 ) . Pr o of. T he er r or ter ms O ( h i ) arise trivially after conv erting norma l co or dinates to tang ent space co or dinates. Thus, we may simply tr eat the integrals as inte- grals in m – dimens ional Euclidean s pa ce to o btain the leading term. The v alues for M 0 and M 1 follow immediately fro m the definition of the volume V m and by symmetry of the sphere. W e obtain the second mo ment result by calculating the v alues on the diago nal and off-diagonal. On the off-diag onal 1 V m Z S m s i s j ds = 0 for i 6 = j due to symmetry of the sphere. On the diago nal 1 V m Z S m s 2 i ds = V m − 1 V m Z 1 − 1 s 2 i (1 − s 2 i ) ( m − 1) / 2 ds i (22) = V m − 1 V m Z 1 − 1 s i × s i (1 − s 2 i ) ( m − 1) / 2 ds i (23) = 0 + V m − 1 V m Z 1 − 1 1 m + 1 (1 − s 2 i ) ( m +1) / 2 ds i (24) = 1 m + 1 V m − 1 V m V m +1 Z 1 − 1 V m +1 (1 − s 2 i ) ( m +1) / 2 ds i (25) = 1 m + 1 V m − 1 V m +1 V m +2 V m (26) = 1 m + 2 (27) where the last equa lity uses the recurr ence relationship V m +2 = 2 π m +2 V m . 8.3 In tegrating t he shifted and p et urb ed sphere Here we calculate the moments us e d in Lemma 7. The in tegra ls in lemma 7 essentially in volve integrating over spher e with (1) a shifted center h 2 ˙ r x ( x ), (2) a s y mmetric shift by sign( s T u ) h 2 α x u on tw o half-spheres, and (3) a small p er turbation h 3 δ x ( s ). Lemma 10 (Moment s of the shifted and p erturb ed sphere) . L et v c ∈ R m , u b e a unit ve ct or in R m , β ∈ R , and h > 0 . D efi ne ˜ K ( s ) = 1 ( s − v c + sign( s T u ) β u < 27 h + h 3 δ ) , so that the supp ort of ˜ K is a shifte d and p erturb e d spher e with c enter v c , symmetric shift sign( s T u ) β u , and ra dius p erturb ation h 3 δ . Assume k v c k , | β | < C h 2 and δ < min { C, 1 } for some c onstant C , and put h max = h + h 3 δ Then M 0 = 1 V m Z R m ˜ K ( s ) ds = h m + ǫ 0 M 1 = 1 V m Z R m s ˜ K ( s ) ds = h m +2 v c + ǫ 1 M 2 = 1 V m Z R m ss T ˜ K ( s ) ds = h m +2 m + 2 1 + ǫ 2 . wher e ǫ 1 < κC h m +1 max and ǫ i < κC h m +3 max for i = 1 , 2 and κ is some u niversal c onstant that do es not dep end on δ, v c , or β . Pr o of. Se t H + = { s ∈ R m : u T s > 0 } and H − = H C + to b e the half-spaces defined by u . F or a s e t H ⊂ R m , let H + v c := { w + v c : w ∈ H } . W e first b ound the er ror int ro duced by the p ertur bation h 3 δ . Define A := supp ( ˜ K ) = { s ∈ R m : s − v c + sign( s T u ) β u < h + h 3 δ } A := { s ∈ R m : s − v c + sign( s T u ) β u < h } so that A gets rid of the dep endence on the p erturbatio n. F or any function Q , we hav e a trivial bo und Z A Q ( s ) ds − Z A Q ( s ) ds < Q max | V ol ( A ) − V ol ( ( A )) | < Q max V m | h m max − h m | < Q max V m ( mh m − 1 max )( h 3 δ ) = O ( h m +2 Q max ) (28) where Q max = s up k s k
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment