Local Support Vector Machines:Formulation and Analysis

Local Support V ector Machines: F ormulation and Analysis Ravi Ganti, Alexander Gray School of Computational Science & Engineering, Georgia T ech gmravi2003@gatech.edu, agray@cc.g atech.edu Abstract W e provide a formulation for Local Support V ector Machines (LSVMs) that gen- eralizes previous formulations, and brings out the explicit connections to local polynomial learning used in nonparametric estimation literature. W e in vesti- gate the simplest type of LSVMs called Local Linear Support V ector Machines (LLSVMs). For the ﬁrst time we establish conditions under which LLSVMs make Bayes consistent predictions at each test point x 0 . W e also establish rates at which the local risk of LLSVMs con ver ges to the minimum value of e xpected local risk at each point x 0 . Using stability ar guments we establish generalization error bounds for LLSVMs. 1 Introduction W e consider the problem of binary classiﬁcation, where we are gi ven a sample S of n i.i.d points, S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∈ ( X × {− 1 , 1 } ) n , X ⊂ R d , and we are required to learn a classiﬁer g n : X → {− 1 , +1 } using S . Binary classiﬁcation is a well studied problem in machine learning [1, 2]. One of the simplest classiﬁcation algorithm is the k-nearest neighbour (kNN) algorithm. The kNN algorithm takes a majority vote ov er the k nearest neighbours of a test point x 0 in order to determine the label of x 0 . The kNN algorithm, among others, belongs to the class of local learning algorithms that take into account only the local information around the test point x 0 in deciding the label of x 0 . Cortes and V apnik [3] introduced the celebrated support vector machine (SVM), which learns a decision boundary by maximizing the margin. Zhang et al. [4], and Blanzieri et al. [5] independently proposed a classiﬁcation algorithm called Local Support V ector Machines (LSVMs) and applied it to remote sensing and visual recognition tasks respectively . LSVMs exploit locality , like kNN, along with the idea of large margin classiﬁcation, to learn a global non-linear classiﬁer , by learning an SVM locally at each test point. By using a lar ge margin approach LSVMs inherit large margin classiﬁer’ s robustness to data perturbation. By utilizing only local information, LSVMs av oid relying on the global geometry of the distribution. This is a good strategy when our data lies on a manifold, where the geodesic distance between close points is approximately Euclidean, b ut the same is not true for points far aw ay . A prime example is the task of image recognition. Segata et al. [6] compared their implementation of approximate LSVM with a standard RBF SVM in LIBSVM for the 2-spirals dataset. The 2-spirals dataset is a 2-dimensional dataset where the data liv es on a manifold. The accurac y of LSVM was 88.47% whereas that of SVM was only 85.29%, and that of kNN was 88.43%. Another scenario where local learning is more beneﬁcial than global learning is when data is multimodal and/or heterogeneous. For e xample, suppose given census data we are required to classify people as belonging to the high income group (HIG) or the middle income group (MIG). A global HIG vs MIG classiﬁer might be hard to build since the notion of HIG/MIG changes with states/counties. Howe ver , it is better to utilize local information, such as information from a particular county , to build multiple local classiﬁers. Our global classiﬁer is then a collection of many such local classiﬁers.The problem of webspam detection is another example where data is heterogeneous. A page might be webspam for one category but may not be for another . In such cases 1 in order to categorize a webpage as spam or not, it is easier to build multiple local classiﬁers rather than a single global classiﬁer . A numerical illustration w as provided by Cheng et al. [7] On a one- vs-all classiﬁcation problem on the Covtype dataset, their implementation of LSVM registered an accuracy of about 90%, whereas the accuracy of RBF SVM was only 86.21% and that of kNN was 67.40% . Since a one-vs-all classiﬁcation problem makes the dataset multimodal and heterogeneous (due to grouping of multiple classes as one single class), the superior performance of LSVMs over SVMs illustrates the power of local learning in such settings. A practical advantage of LSVMs is that they can exploit fast algorithms for range search [8], and v arious other approximation techniques [6], along with parallel computing architectures in order to learn on large datasets. SVMs are well understood both theoretically and practically [2, 3]. Howe ver , no theoretical under- standing yet exists for LSVMs. The empirical success of LSVMs begs a theoretical understanding of such techniques. Our work is the ﬁrst attempt to provide a theoretical understanding of LSVMs. Our contributions are as follo ws: • W e provide a formulation of LSVMs, which generalizes previous formulations due to Blanzieri et al. [5], and Zhang et al. [4], and provide the ﬁrst statistical analysis of locally linear support vector machines (LLSVMs), the simplest kind of LSVMs. Our formulation (Section (2)) is similar to Cheng et al. [7] but is more directly motiv ated from local polynomial regression, and hence the role of a smoothing kernel in our formulation, is much more cleaner than their use of an unspeciﬁed weight function. Our formulation makes e xplicit the direct connections to local polynomial ﬁtting via the use of polynomial Mercer kernels. This allows us to view LSVMs as approximating the decision boundary locally using smooth functions, a nov el interpretation. • In Theorem (1) (Section (3)) we provide suf ﬁcient conditions, which guarantee that, for any giv en point x 0 , the prediction of LLSVMs at x 0 matches that of the Bayes classiﬁer , establishing pointwise consistency . These are conditions on the distrib ution and the model parameters λ, σ . • The LLSVM problem at any point x 0 minimizes the sample version of the stochastic objec- tiv e: min w λ 2 || w || 2 + E L ( y h w , x i ) K ( x, x 0 , σ ) . In Theorem (8) (Section (4)) we provide a high probability bound of ˜ O ( 1 √ nλσ 2 d ) for the dif ference between the smallest value of this stochastic objectiv e and, its value at the solution obtained by solving the LLSVM optimization problem. This result tells us how quickly the stochastic objecti ve, at the solution of the LLSVM problem, con ver ges to its true minimum value. • Deﬁne the L risk of any function f as E L ( y f ( x )) . In Theorem (9) (Section (4)) we establish an upper bound on the gap between the L risk of the global function learnt by LLSVMs, with bandwidth σ , and regularization λ , and the empirical L risk of LLSVMs, via uniform stability bounds. This gap decays as O ( 1 √ nλσ d ) . Notice that while theorems (1),(8) are pointwise results, theorem (9) in v olves the global classiﬁer learnt by solving the LLSVM problem at each training point x i with parameters λ, σ . Hence this is a “global” result. Theorems (1), (9) suggest that LLSVMs should work well in low dimensions, or if the data lies in a lo w-dimensional manifold. This justiﬁes the empirical ﬁndings of Segata et al., and Cheng et al. 2 F ormulation of LLSVMs and LSVMs. Our formulation for LLSVMs is directly motiv ated from a certain technique in nonparametric re- gression called local linear re gression (LLR) [9]. In LLR one ﬁts a non-linear regression function by ﬁtting a linear function locally at each point. The idea of local linear ﬁt is inspired by the fact that any differentiable function can be well approximated locally via linear functions. Hence LLR locally approximates the underlying regression function with a linear function. LLSVMs adopt a similar approach by making local linear ﬁts to the underlying non-linear decision boundary . In order to classify an unseen point x 0 , LLSVMs solve the problem ˆ w reg ∗ = arg min w λ 2 || w || 2 + 1 n P n i =1 L ( y i h w , x i i ) K ( x i , x 0 , σ ) , (1) where K ( x i , x 0 , σ ) is a smoothing kernel with bandwidth σ and L ( · ) is a Lipschitz, con ve x upper bound to the 0-1 loss. In this paper we will be concerned with the hinge loss L ( t ) def = max { 1 − t, 0 } , which is used in SVMs. Some popular examples of smoothing kernels 1 are the Epanechnikov 1 Note that smoothing kernels are not the same as the Mercer kernels used in SVM. A popular example of a smoothing kernel that is not a Mercer kernel is the Epanechnik ov kernel. 2 kernel, the rectangular kernel [9]. Replacing the term L ( y i h w , x i i ) in equation (1) with the term L ( y i h w , φ ( x i ) i ) , where φ ( x i ) is a kernel map induced by a Mercer kernel, we get LSVMs. w ∗ = arg min w λ 2 || w || 2 + 1 n n X i =1 L ( y i h w , φ ( x i ) i ) K ( x i , x 0 , σ ) . (2) Our formulation of LSVMs as sho wn in equation (1) strictly generalizes the formulation of both Blanzieri et al. and Zhang et al.. Strictly speaking, their algorithm used a rectangular smoothing kernel with the bandwidth equal to the distance of the k th nearest neighbour of the test point x 0 in the training set. In comparison our formulation uses a smoothing kernel that allows the formulation to down-weight points in a smooth fashion. The vector ˆ w reg ∗ that is learnt by solving the optimization problem (1) is used for classiﬁcation at x 0 only . Hence, unlike linear SVMs, LLSVMs are still non- linear as the linear ﬁts are only local, and the smoothing kernel precisely determines the locality at each x 0 . T o see a simple example of the inﬂuence of a smoothing kernel, consider LLSVMs with the hinge loss. Standard primal-dual calculations yield ˆ w reg ∗ = 1 λ n X i =1 α i y i K ( x i , x 0 , σ ) x i , 0 ≤ α i ≤ 1 /n (3) where α i are the dual variables. If one uses a ﬁnite tailed smoothing kernel such as an Epanechnik ov kernel or a rectangular kernel, then the points x i which are outside the bandwidth of the kernel, i.e. K ( x i , x 0 , σ ) =0, hav e no ef fect on ˆ w reg ∗ . Hence, the resulting LLSVM does not care about these points and tries to maximize the margin in the input space using points that are close to x 0 . Mercer kernels, that arise out of the kernel map φ ( · ) , on the other hand ha ve nothing to do with locality . Instead they allow us to ﬁt non-linear functions. If one uses a polynomial Mercer kernel of degree d , in conjunction with a smoothing kernel, then it is equiv alent to making local degree d approximations to the boundary function. While such approximations are potentially more po werful than local linear approximations, one would require stronger conditions such as existence of higher order deriv ativ es of the decision boundary , to justify local polynomial approximations. T o avoid making such strong assumptions we shall focus on LLSVMs in this paper . The formulation of Cheng et al. [7] is similar to the optimization problem (2), but uses an unspeciﬁed weight function, σ ( x i , x 0 ) , in the place of K ( x i , x 0 , σ ) . While the importance of the smoothing kernel, and its interaction with Mercer kernels has been distilled in our formulation, the importance and impact of the weight function in the formulation of Cheng et al. w as not done clearly . Related W ork. Kernel based rules (KBRs) have been proposed as a nonparametric classiﬁcation method (see chapter 10 in [10]) and are essentially a simpliﬁed version of LLSVMs. KBRs predict the label of a point x 0 as sgn ( P n i =1 y i K ( x i , x 0 , σ )) . This can be seen as using equation (3) b ut with all α i ’ s set to a constant. Howe ver , for LLSVMs these α v alues themselv es depend on the training data, and hence results from the KBRs literature do not transfer to our case. Learning multiple local classiﬁers has also been done by ﬁrst clustering the data and then learning a classiﬁer in each of these clusters [11, 12], or by using a baseline classiﬁer [13, 14] to ﬁnd regions where the classiﬁer commits errors and then learning a dedicated classiﬁers for each of these erroneous regions. All these algorithms are different from LLSVMs as the y learn a ﬁnite mixture of local classiﬁers from the training data and classify the test point as per the appropriate mixture component. In contrast LLSVMs learn a local classiﬁer on demand for each test point. Ensemble methods also learn multiple classiﬁers and combine them to learn a global model. Ho we ver , the classiﬁcation model is ﬁxed and does not change from one test point to another . Notation. [ n ] def = { 1 , . . . , n } . Let B ( x, σ ) denote a d dimensional ball of radius σ centered around x . Also, let h a, b i def = a T b . Throughout the paper we shall use w ∈ R d +1 to denote a vector learned by using the training data set with a 1 appended to each training point in the data set as the ( d + 1) th dimension. If x ∈ R d then h w , x i def = h w , ¯ x i where ¯ x ∈ R d +1 def = ( x, 1) . Denote by f : X → R an arbitary measurable function. Finally since most of our results are “pointwise results”, we shall use x 0 to represent an arbitrary point, and all “local quantities” will be deﬁned w .r .t. x 0 . 3 Pointwise Consistency of LLSVMs W e now state the assumptions and our ﬁrst main result. 3 • A0: The domain X ⊂ R d is compact, || x || 2 ≤ M for all x in X , and the marginal distribution on X is absolutely continuous w .r .t. the Lebesgue measure. • A1: Let C 1 denote the class of functions that are at least once differentiable on X . W e assume that η ( x ) := P [ y = 1 | x ] ∈ C 1 , and as a result f B ( x ) def = 2 η ( x ) − 1 ∈ C 1 . Such smoothness assumptions (and stronger ones) have been used to study minimax rates for classiﬁcation in [15]. The impact of A1 is two fold. Firstly the minimizer of L risk is a function of η . For hinge loss this function is sgn (2 η ( x ) − 1) . The same holds true ev en for “local” versions of L risk and the 0 − 1 risk. Since η ∈ C 1 , one can in voke continuity ar guments, to guarantee a small enough radius σ , where the minimizer of the L risk is a smooth function. Hence, one can restrict the search for an optimal function to C 1 . The deﬁnition of such local quantities is done in Section (3.2). • A2: K ( · , · , · ) is a ﬁnite tailed smoothing k ernel function that satisﬁes K ( · , · , · ) ≥ 0 (positive kernel), R x 2 ∈ R d K ( x 2 , x 1 , σ ) d x 2 = 1 for all x 1 ∈ X , vanishes for all x / ∈ B ( x 1 , σ ) , and K ( x 1 , x 2 , σ ) ≤ K m = Θ( 1 σ d ) for all x 1 , x 2 ∈ R d . Assumption A2 are standard assumptions from the nonparametric estimation literature [9]. The ﬁnite tail assumption of the smoothing kernel simpliﬁes proofs and should be easy to relax. • A3: For all x 0 in X , lim σ → 0 E [ o ( || x − x 0 || ) K ( x, x 0 , σ )] = 0 . A3 allows us, in the limit, to approximate the minimum local L risk using only linear functions, and therefore allows us to model non-linear decision boundaries via locally linear ﬁts. • A4: Let H σ be the region of intersection of a halfspace and B ( x 0 , σ ) , such that V ol ( H σ ) V ol ( B ( x 0 ,σ )) ≥ 1 2 . Then a.s. w .r .t. D X , lim σ → 0 inf H σ E x ∼D X K ( x, x 0 , σ ) 1 H σ = c 0 x 0 > 0 . A4 requires that the mass in B ( x 0 , σ ) for small σ is spread out and is not all located in a small region in B ( x 0 , σ ) . As a simple example, consider the setup where the mar ginal distribution has uniform density on [ − 1 , +1] , and the k ernel function is the Epanechnikov kernel. Under this setting, for any x 0 ∈ ( − 1 , 1) , we get lim σ → 0 E [ o ( || x − x 0 || ) K ( x, x 0 , σ )] = lim σ → 0 3 8 σ R x 0 + σ x 0 − σ | x − x 0 | (1 − ( x − x 0 σ ) 2 ) d x ≤ lim σ → 0 3 σ 16 = 0 . The same result applies even for x 0 ∈ {− 1 , +1 } . Hence assumption A3 is satisﬁed. T o verify the validity of A4, it is enough to see that lim σ → 0 inf θ ∈ [0 ,σ ] 3 4 σ R x 0 + θ x 0 − σ (1 − ( x − x 0 σ ) 2 ) d x = 1 / 4 . Hence c 0 x 0 = 1 / 4 . Finally , we shall work with only the hinge loss. Hence, whenev er we refer to L risk we basically mean the risk due to the hinge loss. W e are no w in a position to state our ﬁrst result re garding pointwise consistency of LLSVMs. Theorem 1. Given an x 0 ∈ X , if assumptions A1-A4 hold, then there exists an n 0 ∈ N such that an LLSVM that solves the pr oblem (1) at x 0 , agrees with the Bayes classiﬁer at x 0 , for all n ≥ n 0 , and appr opriate σ, λ > 0 that satisfy n → ∞ , λ, σ → 0 , such that nλ 2 σ 4 d ln 1+ θ n → ∞ for some θ > 0 . 3.1 Discussion of theorem 1. Theorem (1) provides us with conditions on n, λ, σ to guarantee that the learnt LLSVM makes a Bayes consistent decision at an arbitrary test point x 0 . Like KBR, SVMs, and LPR, our results require n to gro w and λ, σ to decay at certain rates that are precisely captured by theorem (1). In classiﬁcation, global consistency results [10, 16] are proved which demonstrate that that the 0-1 risk of the classiﬁer conv erges to that of a Bayes classiﬁer asymptotically . Such global consistency results are asymptotic in nature. In comparison we prove that, at an arbitrary x 0 , we can choose sufﬁciently large amount of data, and appropriate parameter settings λ, σ (depending on n ) such that the LLSVMs decision matches that of the Bayes classiﬁer at x 0 . For these reasons it seems inappro- priate to compare consistency results of SVMs with those for pointwise consistency of LLSVMs. Proving a global consistency result for LLSVMs remains an open problem, that we intend to tackle in the future. In the case of LPR, howe ver , pointwise properties has been inv estigated [9], such as ho w quickly the squared loss of an LPR estimator at point x 0 con ver ges to the squared loss of the true function. Here we are guaranteed that as n → ∞ , and with appropriate σ , and with any degree of the polynomial, the excess error at x 0 con ver ges to 0. Howe ver , as stated abo ve, we can prov e that we can predict the label of x 0 correctly with a ﬁnite amount of data. It is inappropriate to compare the results of LPR and LLSVMs, since LPR requires prediction of a real valued quantity , whereas LLSVMs are concerned with prediction of a binary label. As we mentioned in the related work section, the proof strategy that was used to pro ve the consistency of kernel based rules does not work for our case. T echniques from the literature for LPR [9] cannot be used for proving the pointwise consistency result for LLSVMs. This is because, in LPR we are interested in the squared 4 loss of the estimator . Squared loss allows a bias-variance decomposition, and the analysis requires the analysis of this decomposition. Howe ver , in classiﬁcation we are concerned with 0-1 loss, which does not allow such a decomposition. 3.2 Overview of Pr oof of Theorem (1) As the proof of theorem (1) is quite inv olved we shall ﬁrst present an overvie w of our proof. Since the statement of theorem (1) is for each point x 0 , we will deﬁne certain local quantities, and use them throught our proof. Our proof has three main steps. W e ﬁrst establish the approximation properties of our function class. W e then make a connection between 0-1 risk and the L risk, since the LLSVM problem works with the L risk. Finally , we need a bound on the estimation error of LLSVM , which roughly says, how good is the LLSVM objective as a proxy to the expected local L risk. W e shall explain these three main steps in greater detail now . W e borrow some of the ideas from the proof of consistency of SVMs by Steinwart [16], and shall make appropriate comparisons whenev er required. The ﬁrst step (Lemma (3)) is to establish the local approximation properties of linear func- tions. In order to do so we deﬁne the r e gularized local L risk , R reg ( w ) def = λ 2 || w || 2 + E [ L ( y h w, x i ) K ( x, x 0 , σ )] , and its corresponding unregurlarized version, called local L risk , R ( f ) = E [ L ( y f ( x )) K ( x, x 0 , σ )] . The minimizer of r e gularized local L risk among linear func- tions is denoted as w reg ∗ = arg min w R reg ( w ) . In Lemma (3) we pro ve that for small enough λ, σ , the minimum of local L risk among C 1 functions, i.e. inf f ∈ C 1 R ( f ) , can be well approximated by R reg ( w reg ∗ ) . A similar type of result, although with global quantities, was proved by Steinwart for SVMs. Howe ver , there are two main dif ferences. Firstly Steinw art’ s proof exploited the uni versal properties of RKHS spaces. Since we work with linear kernels which are not univ ersal, Steinwart’ s arguments do not apply here. W e instead use local approximation of C 1 functions by linear func- tions, which is made possible by a simple use of T aylor’ s expansion. Secondly while we work with C 1 functions, Steinwart’ s proof w orks with the space of all measurable functions. This is because their proof does not mak e any assumptions on the smoothness of η ( · ) . Ho we ver , our assumption A1 guarantees that it is enough to work with just C 1 functions. The second step (Lemma (4)) connects L risk with 0-1 risk. In order to do so we deﬁne the local risk of a function f , as R 0 − 1 ( f ) = E [ 1 [ y f ( x ) ≤ 0] K ( x, x 0 , σ )] . The excess local risk of f is simply R 0 − 1 ( f ) − inf f R 0 − 1 ( f ) , which we prov e in lemma (2) to be equal to R 0 − 1 ( f ) − R 0 − 1 ( f B ) . In lemma (4) we prove that, for small enough σ , the difference between the local L risk of a function, f , and a function, in C 1 , with the smallest local risk, is an upper bound on the excess local 0-1 risk of f . This result is nothing but a local v ersion of the result that was ﬁrst stated in [17, 18]. In the third step, via lemmas (5)-(7) we bound the de viation of the empir- ical local risk, ˆ R ( w ) def = 1 n P n i =1 L ( y i h w , x i i ) K ( x i , x 0 , σ ) , from the local risk, R ( w ) def = E L ( y h w, x i ) K ( x, x 0 , σ ) , for the solution of problem (1). This is done via uniform stability arguments [19]. A similar result was also used by Steinwart, albeit, for global quantities. The fourth and ﬁnal step puts together all these results to establish conditions for a.s. con v ergence of the sequence R 0 − 1 ( ˆ w reg ∗ ) def = E [ 1 ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ )] to inf f ∈ C 1 R ( f ) . W e then use this stochastic conv ergence along with assumption A4 to establish theorem (1). The proof of this ﬁnal step exploits the fact that η is a continuous function. Lemma 2. Let f ∗ = arg inf f R 0 − 1 ( f ) . Then, ∀ x ∈ B ( x 0 , σ ) , f ∗ ( x ) ≥ 0 ⇔ η ( x ) ≥ 1 2 . Hence R 0 − 1 ( f ∗ ) = R x 0 ( f B ) . Pr oof. W e have R 0 − 1 ( f ) = E x ( η ( x )1( f ( x ) < 0) + (1 − η ( x ))1( f ( x ) > 0)) K ( x, x 0 , σ )) . Hence, R 0 − 1 ( f ) − R 0 − 1 ( f ∗ ) = E x (2 η ( x ) − 1)(1( f ( x ) ≥ 0) − 1( f ∗ ( x ) < 0)) K ( x, x 0 , σ ) . Now by deﬁnition the above term is non-ne gativ e for all measurable functions f . Hence in B ( x 0 , σ ) the behavior of f ∗ is exactly the same as that of Bayes classiﬁer . The abo ve lemma tells us that even though the local risk uses a kernel function to weight the loss function, the minimizer of the local 0 - 1 risk, in a σ neighborhood of x 0 , behav es lik e the Bayes optimal classiﬁer . This simple yet crucial result, would not be valid if one used a kernel that could take ne gati ve v alues (neg ativ e kernels). i.e. with a negati ve kernel it is not possible to guarantee that f ∗ x 0 ,σ > 0 ⇔ η ( x ) 1 [ x ∈ B ( x 0 , σ )] ≥ 1 / 2 . 5 Lemma 3. Under assumptions A1-A3, at any point x 0 ∈ R d , w r eg ∗ satisﬁes the pr operty lim σ → 0 [ lim λ → 0 R r eg ( w r eg ∗ ) − inf f ∈ C 1 R ( f )] = 0 . (4) Pr oof. Step 1. W e shall be gin by proving the follo wing statement. ∀ σ > 0 : lim λ → 0 R reg ( w reg ∗ ) = inf w R ( w ) . (5) Fix a σ > 0 , and let  > 0 be giv en. Since R ( · ) is a continuous conv ex function, hence it is possible to ﬁnd atleast one w ,σ with || w ,σ || < ∞ , such that R ( w ,σ ) ≤ inf w R ( w ) +  Since λ 2 || w || 2 is continuous in λ , there e xists a λ ( , σ ) such that for all λ ≤ λ ( , σ ) : λ 2 || w ,σ || 2 ≤  . Now for an y λ ≤ λ 0 , we get R reg ( w reg ∗ ) ≤ R reg ( w ,σ ) = λ 2 || w ,σ || 2 + E L ( y h w ,σ , x i ) K ( x, x 0 , σ ) ≤ 2  + inf w R ( w ) . (6) Since  was arbitrary equation (5) follo ws. Step 2. In the second step we pro ve that lim σ → 0 [inf w R ( w ) − inf f ∈ C 1 R ( f )] = 0 . Suppose the real valued function g σ is the minimizer of R ( f ) for f ∈ C 1 . By T aylor expansion we hav e g σ ( x ) = g σ ( x 0 ) + D g σ ( x 0 )( x − x 0 ) + o ( || x − x 0 || ) . Hence, inf w R ( w ) − inf f ∈ C 1 R ( f ) = inf w R ( w ) − inf f ∈ C 1 R ( f ) ≤ R ( w ∗ ) − R ( g σ ) = E [( L ( y h w ∗ , x i ) − L ( y g σ ( x ))) K ( x, x 0 , σ )] ≤ E [ o ( || x − x 0 || ) K ( x, x 0 , σ )] → 0 , where the last step is due to A2. This completes the proof of our second part. Lemma 4. S uppose η ( x 0 ) 6 = 1 / 2 . Then for a suf ﬁciently small σ , such that η ( x ) 6 = 1 / 2 for any x ∈ B ( x 0 , σ ) , we get R 0 − 1 ( f ) − inf f R 0 − 1 ( f ) ≤ R ( f ) − inf f R ( f ) . Pr oof. Deﬁne ∆ = { x | f ( x ) f B ( x ) < 0 } , f ∗ L ( x ) = sgn (2 η ( x ) − 1) . R 0 − 1 ( f ) − R 0 − 1 = E [ | 2 η ( x ) − 1 | K ( x, x 0 , σ ) 1 ∆ ] (a) ≤ E [(1 − η ( x ) L ( f ∗ L ) − (1 − η ( x )) L ( f ∗ L )) K ( x, x 0 , σ ) 1 ∆ ] (b) ≤ R ( f ) − R ( f ∗ L )) In step (a) we used the fact that for the hinge loss | 2 η ( x ) − 1 | ≤ 1 − ( η ( x ) L ( f ∗ L ) + (1 − η ( x )) L ( f ∗ L )) , and in step (b) we used the fact that on the e vent ∆ , it is better to predict using the 0 function rather than predicting with f . W e now need the notion of uniform stability to establish the concentration result, which were out- lined in the proof overvie w . Roughly uniform stability [19] bounds the dif ference in loss of a learning algorithm, at any arbitrary point, due to remo val of an y one point from the training dataset. Lemma 5. LLSVMs obtained by solving the optimization problem (1) at any point x 0 has uniform stability of O  2 M 2 nλσ 2 d  w .r .t. the loss function L ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ ) . Pr oof. Let ˆ w reg ∗ , ˆ w − i, reg ∗ be the LLSVMs learned at x 0 using data sets S , S − i respectiv ely . F or any z = ( x, y ) ∈ X × {− 1 , +1 } , we ha ve  L ( y h ˆ w reg ∗ , x i ) − L ( y h ˆ w − i, reg ∗ , x i )  K ( x, x 0 , σ ) ≤ M K m || ˆ w reg ∗ − ˆ w − i, reg ∗ || . (7) Hence it is enough to bound || ˆ w reg ∗ − ˆ w − i, reg ∗ || . By deﬁnition both ˆ w reg ∗ , ˆ w − i, reg ∗ are solutions of their respectiv e con vex optimization problem. Let N ( w ) def = λ 2 || w − ˆ w − i, reg ∗ || 2 + 1 n  n X j =1 dL ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j − n X j 6 = i dL ( y j h ˆ w − i, reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j , w − ˆ w − i, reg ∗  , (8) 6 where dL ( . ) is an element of the subgradient of L at the appropriate arguement. W e have N ( ˆ w − i, reg ∗ ) = 0 , dN ( ˆ w reg ∗ ) = 0 . Hence ˆ w reg ∗ is an optimal solution of the minimization prob- lem: min w N ( w ) , and we have N ( ˆ w reg ∗ ) ≤ N ( ˆ w − i, reg ∗ ) ≤ 0 . W e get λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (a) ≤ − 1 n h dL ( y i h ˆ w reg ∗ , x i i ) K ( x i , x 0 , σ ) y i x i , ˆ w reg ∗ − ˆ w − i, reg ∗ i ≤ M K m n || ˆ w reg ∗ − ˆ w − i, reg ∗ || . || ˆ w reg ∗ − ˆ w − i, reg ∗ || ≤ 2 M K m /nλ. (9) where the inequality in step (a) uses properties of con ve x functions. Using Equations (7), (9) we get  L ( y h ˆ w reg ∗ , x i ) − L ( y h ˆ w − i, reg ∗ , x i )  K ( x, x 0 , σ ) ≤ O ( 2 M 2 nλσ 2 d ) . Lemma 6. [19] Let A S be the hypothesis learnt by an algorithm A on dataset S , suc h that 0 ≤ L ( A S , z ) ≤ M 1 . Suppose A has uniform stability β w .r .t L ( · ) . Then, ∀ n ≥ 1 , δ ∈ (0 , 1) , we have P [ R − R emp ≥ 2 β +  ] ≤ exp( − 2 n 2 / (4 nβ + M 1 ) 2 ) . (10) Lemma 7. F or any point x 0 ∈ R d we have P h R ( ˆ w reg ∗ ) − ˆ R ( ˆ w reg ∗ ) ≥ 4 M 2 nλσ 2 d +  i ≤ exp − 2 nλ 2 σ 4 d  2 (8 M 2 + λσ d + M √ λσ d ) 2 ! . (11) Pr oof. The desired result follows from lemmas (5)-(10) and by substituting ˆ R ( ˆ w reg ∗ ) for R emp and R ( ˆ w reg ∗ ) for R in lemma (10), and by susbtituting M 1 = O ( 1 σ d ) + O ( M σ d √ λσ d ) , which was obtained by using the fact that hinge loss is 1-Lipschitz. Proof of Theorem (1) . The proof is in two parts. In the ﬁrst part we shall prove that under the conditions stated in the premise of the theorem R 0 − 1 ( ˆ w reg ∗ ) → R 0 − 1 ( f B ) a.s. The second part then uses this almost sure con v ergene of local risk to guarantee that ˆ w reg ∗ and f B agree on the label of x 0 . Fix any  > 0 . Let δ (1) n,λ,σ def = exp  −  2 nσ 2 d 2(1+ M √ 2 λ E K ( x,x 0 ,σ )) 2  , δ (2) n,λ,σ def = exp  − 2 nλ 2 σ 4 d  2 (8 M 2 + λσ d + M √ λσ d ) 2  . Deﬁne δ n,λ,σ def = δ (1) n,λ,σ + δ (2) n,λ,σ . For appropriately chosen values of σ (  ) , λ ( σ (  )) we hav e with probability atleast 1 − δ n,λ,σ R reg ( ˆ w reg ∗ ) = λ 2 || ˆ w reg ∗ || 2 + R ( ˆ w reg ∗ ) (a) ≤ λ 2 || ˆ w reg ∗ || 2 + ˆ R ( ˆ w reg ∗ ) + 4 M 2 nλσ 2 d +  (b) ≤ λ 2 || w reg ∗ || 2 + ˆ R ( w reg ∗ ) + 4 M 2 nλσ 2 d +  (c) ≤ λ 2 || w reg ∗ || 2 + R ( w reg ∗ ) + 4 M 2 nλσ 2 d + 2  = R reg ( w reg ∗ ) + 4 M 2 nλσ 2 d + 2  (d) ≤ inf f ∈ C 1 R ( f ) + 4 M 2 nλσ 2 d + 4  + λ 2 || w ,σ || 2 + E ( o ( || x − x 0 || ) K ( x, x 0 , σ )) . (12) In the above equations step (a) follo ws from lemma (7), and hence there is a failure probability of at most δ (1) n,λ,σ . Step (b) follows from the fact that ˆ w reg ∗ is the minimizer of R reg , and step (c) uses the Hoef fding inequality , and incurs a failure probability of δ (2) n,λ,σ . Choosing small enough σ (  ) , λ ( σ (  ) ,  ) , inequality (d) follows from lemma (3). Applying lemma (4) we get with probability atleast 1 − δ n,λ,σ , R 0 − 1 ( ˆ w reg ∗ ) − inf f R 0 − 1 ( f ) ≤ R ( ˆ w reg ∗ ) − inf f ∈ C 1 R ≤ R reg ( ˆ w reg ∗ ) − inf f ∈ C 1 R ( f ) (a) ≤ 4 M 2 nλσ 2 d + 4  + λ 2 || w ,σ || 2 + E ( o ( || x − x 0 || ) K ( x, x 0 , σ )) . Step (a) follows from equation (12), and the fact that the marginal distribution on X is abso- lutely continuous. The absolute continuity gurantees that λ 2 || w ,σ || 2 → 0 . If n → ∞ , λ → 0 , σ → 0 , nλ 2 σ 4 d → ∞ we conclude that R 0 − 1 ( ˆ w reg ∗ ) → inf f ∈ C 1 R 0 − 1 ( f ) = R 0 − 1 ( f B ) in probability . Since for data-dependent choices of λ, σ that satisfy λ, σ → 0 , nλ 2 σ 4 d log 1+  ( n ) → ∞ , we get P ∞ n =1 δ n,λ,σ < ∞ , hence by Borel-Cantelli lemma the con ve gence R 0 − 1 ( ˆ w reg ∗ ) → inf f R 0 − 1 ( f ) = R 0 − 1 ( f B ) also happens almost surely . 7 r rrrrrrr Figure 1: All the points in this ball, of radius σ , centered around x 0 are labeled +1 by the Bayes classiﬁer . The region of intersec- tion between the hyperplane, and the ball, which contains the cen- ter , x 0 , is misclassiﬁed by the hy- perplane. The volume of this re- gion is at least half of the v olume of the ball. W e shall now prove the second part. If η ( x 0 ) = 1 / 2 , then the prediction of LLSVMs at point x 0 is irrelev ant. Hence let η ( x 0 ) > 1 / 2 . The proof is the same if η ( x 0 ) < 1 / 2 . Choose σ 1 such that inf x ∈ B ( x 0 ,σ 1 ) 2 η ( x ) − 1 ≥ 2 η ( x 0 ) − 1 2 . Notice that because of continuity 2 η ( x ) − 1 has the same sign everywhere in B ( x 0 , σ 1 ) (see Figure (1)). From A5 we are guaranteed that there exists σ 2 > 0 such that for all 0 < σ ≤ σ 2 , we ha ve inf H σ E K ( x, x 0 , σ ) 1 H σ ≥ c 0 x 0 2 . Let 0 < σ 0 ≤ min { σ 1 , σ 2 } . Now from the ﬁrst part of the proof we know that R 0 − 1 ( ˆ w reg ∗ ) → R 0 − 1 ( f B ) almost surely . This guarantees that there exists a sufﬁ- ciently lar ge n 0 such that for appropriate σ ≤ σ 0 , and an appropri- ate choice of λ , we get P [ R 0 − 1 ( ˆ w reg ∗ ) − R 0 − 1 ( f B ) ≤ c 0 x 0 | 2 η ( x 0 ) − 1 | / 8] = 1 . (13) Now for the abov e choice of n 0 , λ, σ , represent by ∆ the region of disagreement between ˆ w reg ∗ and f B . Assume that x 0 ∈ ∆ . Since 2 η ( x ) − 1 has the same sign everywhere in B ( x 0 , σ ) , we get ∆ = { x ∈ B ( x 0 , σ ) |h ˆ w reg ∗ , x i ≤ 0 } , and hence the volume of ∆ is at least half of B ( x 0 , σ ) . Hence R 0 − 1 ( ˆ w reg ∗ ) − R 0 − 1 ( f B ) = E | 2 η ( x ) − 1 | K ( x, x 0 , σ ) 1 ∆ ≥ 2 η ( x 0 ) − 1 2 E K ( x, x 0 , σ ) 1 ∆ ≥ 2 η ( x 0 ) − 1 2 inf H σ E K ( x, x 0 , σ ) 1 H σ ≥ (2 η ( x 0 ) − 1) c 0 x 0 4 , (14) which is a contradiction to equation (13). Hence f B and ˆ w reg ∗ agree on the label of x 0 . 4 Risk Bounds and Rates of Con ver gence to Stochastic Objective. LLSVMs solves a local optimization problem that can be seen as minimizing an empirical ver- sion of the stochastic objective R reg ( w ) . It is then natural to ask as to ho w quickly does the value of the stochastic objecti ve for w = ˆ w reg ∗ con ver ge to the minima of the stochastic objectiv e? In Theorem (8) we demonstrate, via stability arguments, that for an arbitrary test point x 0 , this con- ver gence happens at the rate of O (1 / √ nλσ 2 d ) . In Theorem (9) we establish generalization bounds for a global classiﬁer learnt by solving LLSVM’ s at any randomly chosen point x , in terms of the empirical error of LLSVMs. Due to lack of space the proofs are postponed to the supplement. Theorem 8. W ith pr obability at least 1 − δ over the r andom input training set we have R r eg ( ˆ w reg ∗ ) − R r eg ( w r eg ∗ ) ≤ ˜ O  1 √ nλσ 2 d  . (15) Discussion of theorem (8). In theorem (8), it might be possible to improve the dependence on n from 1 √ n to 1 n via the peeling idea [20]. Based on [20], we conjecture that the dependence on λ is optimal, while the dependence on σ may be improved from 1 /σ 2 d to 1 /σ d . Theorem 9. Let ˆ w reg ∗ ( x ) be the vector obtained by solving the LLSVM pr oblem, with parameter s λ, σ , at a randomly drawn point x . W ith pr obability at least 1 − δ over the random sample , we have E L ( y h ˆ w reg ∗ ( x ) , x i ) ≤ 1 n n X i =1 L ( y i h ˆ w reg ∗ ( x i ) , x i i ) + 4 M 2 nλσ d + (1 + O ( M q 1 /λσ d )) p ln(1 /δ ) / 2 n. Discussion of Theorem (9). W ithout any further noise assumptions, the dependence on n, λ, σ is optimal. W ith the Tsybakov’ s [2] noise assumption, it is possible to improve the dependence on n . The exponential dependence on d is expected, and is typical of nonparametric methods. 5 Proofs of Theor ems 8,9 For con venience we shall be gin with a risk bound from [19]. This risk bound relies on the notion of uniform stability . For any learning algorithm A that learns a function A S after having trained on the dataset S the uniform stability quantiﬁes the absolute maginitude of the change in loss suffered by the algorithm at any arbitrary point in the space if an arbitrary x i is removed from the training dataset. The precise deﬁnition is as follo ws 8 Deﬁnition 1. [19] An algorithm A has uniform stability β w .r .t the loss function L if: ∀ S, ∀ i ∈ { 1 , . . . , n } , || L ( A S , · ) − L ( A S − i , · ) || ∞ ≤ β (16) Lemma 10. Let A be an algorithm with uniform stability β w .r .t a loss function 0 ≤ L ( A S , ( x, y )) ≤ M 1 , for all z def = ( x, y ) and all set S. Then for any n ≥ 1 , and any δ ∈ (0 , 1) , the following bound holds true with pr obability atleast 1 − δ over the random draw of the sample S. E z ∼D L ( A S , z ) ≤ 1 n n X i =1 L ( A s , z i ) + 2 β + (4 nβ + M 1 ) r log (1 /δ ) 2 n . (17) Theorem 8. W ith pr obability 1 − δ over the r andom input training set we have R r eg ( ˆ w reg ∗ ) − R r eg ( w r eg ∗ ) ≤ 2 β + (4 nβ + M 1 ) s log ( 1 δ ) 2 n , (18) wher e β ≤ 2 M K m nλ " r 2 λL (0) n v u u t n X j =1 K ( x j , x 0 , σ ) + M K m # (19) M 1 ≤ L (0) n n X j =1 K ( x j , x 0 , σ ) + L (0) K m + K m M v u u t 2 L (0) nλ n X j =1 K ( x j , x 0 , σ ) (20) Pr oof. The proof is via stability ar guments. Let z def = ( x, y ) . Consider the loss function q ( w, z ) = λ 2 || w || 2 + L ( y h w , x i ) K ( x, x 0 , σ ) − h λ 2 || w reg ∗ || 2 + L ( y h w reg ∗ , x i ) K ( x, x 0 , σ ) i . (21) It is enough to bound the stability of LLSVM’ s w .r .t the above loss function and also upper bound the above loss. In order to upper bound the stability of LLSVM’ s it is enough to upper bound for all S, ( x, y ) the quantity | q ( ˆ w reg ∗ , ( x, y )) − q ( ˆ w − i, reg ∗ , ( x, y )) | , where S − i is the dataset obtained from S by deleting the point ( x i , y i ) , and h ˆ w − i, reg ∗ , x i is the LLSVM learnt at x 0 with S − i . From Equation (21) it is clear that q ( w, z ) is λ strongly con ve x in w in L 2 norm. Hence by strong con ve xity q ( ˆ w reg ∗ , z ) ≥ q ( ˆ w − i, reg ∗ , z ) + ( ˆ w reg ∗ − ˆ w − i, reg ∗ ) T ∂ q ( ˆ w − i, reg ∗ , z )+ λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (22) Similarily we hav e q ( ˆ w − i, reg ∗ , z ) ≥ q ( ˆ w reg ∗ , z ) + ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w reg ∗ , z )+ λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (23) From equations (22,23) we get ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w reg ∗ , z ) + λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 ≤ q ( ˆ w − i, reg ∗ , z ) − q ( ˆ w reg ∗ , z ) ≤ ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w − i, reg ∗ , z ) − λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (24) W e shall now upper and lower bound the rightmost and the leftmost terms respectively . Doing this will enable us to bound the stability . Differentiating Equation (21) w .r .t w we get ∂ q ( ˆ w − i, reg ∗ , z ) = λ ˆ w − i, reg ∗ + ∂ L ( y h ˆ w − i, reg ∗ , x i ) y K ( x, x 0 , σ ) x (25) ∂ q ( ˆ w reg ∗ , z ) = λ ˆ w reg ∗ + ∂ L ( y h ˆ w reg ∗ , x i ) y K ( x, x 0 , σ ) x (26) 9 Now in order to bound the rightmost term of equation (24) we use equation (25) to get ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w − i, reg ∗ , x ) − λ 2 || ˆ w − i, reg ∗ − ˆ w reg ∗ || 2 ≤ ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T h λ ˆ w − i, reg ∗ + ∂ L ( y h ˆ w − i, reg ∗ i ) y K ( x, x 0 , σ ) x i ≤ || ˆ w − i, reg ∗ − ˆ w reg ∗ || || λ ˆ w − i, reg ∗ + ∂ L ( y h ˆ w − i, reg ∗ i ) y K ( x, x 0 , σ ) x || (27) where the last inequality follo ws from Cauchy-Schwartz inequality . W e shall be gin by bounding || ˆ w − i, reg ∗ − ˆ w reg ∗ || . No w by the deﬁnition of ˆ w − i, reg ∗ , ˆ w reg ∗ we get λ ˆ w reg ∗ + 1 n n X j =1 L ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j = 0 (28) λ ˆ w − i, reg ∗ + 1 n n X j =1 j 6 = i L ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j = 0 (29) Now consider the follo wing con ve x optimization problem N ( w ) = λ 2 || w − ˆ w − i, reg ∗ || 2 + 1 n h n X j =1 ∂ L ( y j h w , x j i ) K ( x j , x 0 , σ ) y j x j − n X j =1 j 6 = i ∂ L ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j , w − ˆ w − i, reg ∗ i (30) It is trivial to v erify using equations (28,29) that ∂ N ( ˆ w reg ∗ ) ∂ w = 0 , and hence from con vex analysis we know that ˆ w reg ∗ is the optimal solution of the con ve x optimization problem N ( w ) . Also N ( ˆ w reg ∗ ) ≤ N ( ˆ w − i, reg ∗ ) = 0 . Hence we get λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 ≤ − 1 n X j =1 j 6 = i h ∂ L ( y j h ˆ w reg ∗ , x j i ) y j K ( x j , x 0 , σ ) x j − ∂ L ( y j h ˆ w − i, reg ∗ , x j i ) y j K ( x j , x 0 , σ ) x j , ˆ w reg ∗ − ˆ w − i, reg ∗ i − 1 n h ∂ L ( y i h ˆ w reg ∗ , x i i ) y i K ( x i , x 0 , σ ) x i , ˆ w reg ∗ − ˆ w − i, reg ∗ i ≤ − 1 n h ∂ L ( y i h ˆ w reg ∗ , x i i ) y i K ( x i , x 0 , σ ) x i , ˆ w reg ∗ − ˆ w − i, reg ∗ i ≤ 1 n M K ( x i , x 0 , σ ) || ˆ w reg ∗ − ˆ w − i, reg ∗ || (31) where the second inequality is due to the f act that L ( · ) is a con vex loss function, and hence ( dL ( b ) − dL ( a ))( b − a ) ≥ 0 , and the last inequality due to Cauchy-Schwartz and the fact that L ( · ) is 1 Lipschitz. Hence we get || ˆ w − i, reg ∗ − ˆ w reg ∗ || ≤ 2 nλ M K ( x i , x 0 , σ ) . (32) Finally we hav e by the optimality of ˆ w − i, reg ∗ , ˆ w reg ∗ λ 2 || ˆ w − i, reg ∗ || 2 ≤ 1 n n X j =1 j 6 = i L (0) K ( x i , x 0 , σ ) (33) λ 2 || ˆ w reg ∗ || 2 ≤ 1 n n X j =1 L (0) K ( x j , x 0 , σ ) (34) 10 Using equations (24,25,27,32,33) we get q ( ˆ w − i, reg ∗ , z ) − q ( ˆ w reg ∗ , z ) ≤ 2 M K ( x i , x 0 , σ ) nλ h r 2 λL (0) n v u u u t n X j =1 j 6 = i K ( x j , x 0 , σ ) + M K ( x, x 0 , σ ) i (35) One can use similar techniques to lower bound the leftmost term in Equation (24) to get q ( ˆ w − i, reg ∗ , z ) − q ( ˆ w reg ∗ , z ) ≥ − 2 M nλ K ( x i , x 0 , σ ) h r 2 λL (0) n v u u t n X j =1 K ( x j , x 0 , σ ) + M K ( x, x 0 , σ ) i (36) Using the fact that β = sup S,z | q ( ˆ w reg ∗ , z ) − q ( ˆ w reg ∗ , z ) | and equations (35,36) we get β ≤ 2 M K m nλ h r 2 λL (0) n v u u t n X j =1 K ( x j , x 0 , σ ) + M K m i (37) In order to apply theorem (10) it is enough to upper bound q ( ˆ w reg ∗ , z ) . W e have q ( ˆ w reg ∗ , z ) ≤ λ 2 || ˆ w reg ∗ || 2 + L ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ ) ≤ L (0) n n X j =1 K ( x j , x 0 , σ ) + L ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ ) ≤ L (0) n n X j =1 K ( x j , x 0 , σ ) + L (0) K m + K m M v u u t 2 L (0) nλ n X j =1 K ( x j , x 0 , σ ) . (38) Now applying theorem (10) to LLSVM’ s with the loss function q ( A s , z ) and since ˆ R reg ( ˆ w reg ∗ ) ≤ ˆ R reg ( w reg ∗ ) we get the desired result. Theorem 9. Let ˆ w reg ∗ ( x ) be the solution obtained by solving the LLSVM pr oblem at x . W ith pr ob- ability at least 1 − δ over the random sample for an LLSVM, we have E L ( y h ˆ w reg ∗ ( x ) , x i ) ≤ 1 n n X i =1 L ( y i h ˆ w reg ∗ ( x i ) , x i i ) + 4 M 2 nλσ d +  1 + O ( M q 1 /λσ d )  r ln(1 /δ ) 2 n . Pr oof. By lemma (10) we are done if we can upper bound the loss suffered by LLSVMs at any point, and the stability of LLSVMs w .r .t the loss L ( y h ˆ w reg ∗ ( x ) , x i ) . W e have | L ( y h ˆ w reg ∗ ( x ) , x i ) − L ( y h ˆ w − i, reg ∗ ( x ) , x i ) | = O  2 M 2 /nλσ d  , where we used the upper bound on || ˆ w reg ∗ ( x ) − ˆ w − i, reg ∗ ( x ) || presented in Equation 9 of Lemma 5 in the main paper . Finally L ( y h ˆ w reg ∗ ( x ) , x i ) ≤ 1 + M || ˆ w reg ∗ ( x ) || ≤ 1 + O ( M p 1 /λσ d ) . Apply lemma (10) with β = O (2 M 2 /nλσ d ) and M 1 = 1 + O ( M p 1 /λσ d ) to ﬁnish the proof. 6 Discussion and Open Problems Our results guarantee that the decision of an LLSVM learnt at x 0 matches that of the Bayes classiﬁer after having seen enough data. An important open problem is to establish global Bayes consistency of LLSVMs. It is not clear to us if the pointwise consistency result can be used to do so. Theo- rem (9) currently does not e xploit our lar ge mar gin formulation. A natural e xtension of this theorem would be to establish a result that depends on some kind of a local notion of mar gin. Our current results depend on the dimensionality of the ambient space. It should be possible, under appropriate manifold assumptions [21], [22] to improv e this dependency to use the intrinsic dimension. 11 References [1] T . Hastie, R. T ibshirani, and J. H. Friedman. The Elements of Statistical Learning . Springer , July 2003. [2] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classiﬁcation: A surve y of some recent advances. ESAIM: P&S , 9:323–375, 2005. [3] V .N. V apnik. Statistical learning theory . W iley New Y ork, 1998. [4] H. Zhang, A.C. Ber g, M. Maire, and J. Malik. Svm-knn: Discriminative nearest neighbor classiﬁcation for visual category recognition. In Computer V ision and P attern Recognition, 2006 IEEE Computer Society Confer ence on , volume 2, pages 2126–2136. IEEE, 2006. [5] E. Blanzieri and F . Melgani. An adaptive svm nearest neighbor classiﬁer for remotely sensed imagery . In Geoscience and Remote Sensing Symposium, 2006. IGARSS 2006. IEEE International Confer ence on , pages 3931–3934. IEEE, 2006. [6] N. Segata and E. Blanzieri. F ast and scalable local kernel machines. JMLR , 2010. [7] H. Cheng, P .N. T an, and R. Jin. Ef ﬁcient Algorithm for Localized Support V ector Machine. IEEE T ransactions on Knowledg e and Data Engineering , pages 537–549, 2009. [8] A. Gray and A. W . Moore. N-body problems in statistical learning. In T odd K. Leen, Thomas G. Di- etterich, and V olker Tres p, editors, Advances in Neural Information Pr ocessing Systems 13 (December 2000) . MIT Press, 2001. [9] A.B. Tsybakov . Intr oduction to nonparametric estimation . Springer V erlag, 2009. [10] Luc Devro ye, L. Gy ¨ o rﬁ, and G. Lugosi. A Pr obabilistic Theory of P attern Recognition . Springer , 1996. [11] T .K. Kim and J. Kittler . Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. P attern Analysis and Machine Intelligence, IEEE T ransactions on , 27(3):318–327, 2005. [12] S. Y an, H. Zhang, Y . Hu, B. Zhang, and Q. Cheng. Discriminant analysis on embedded manifold. Com- puter V ision-ECCV 2004 , pages 121–132, 2004. [13] J. Dai, S. Y an, X. T ang, and J.T . Kwok. Locally adaptiv e classiﬁcation piloted by uncertainty . In ICML , pages 225–232. A CM, 2006. [14] O. Dek el and O. Shamir . Theres a hole in my data space: Piecewise predictors for heterogeneous learning problems. In AIST A TS , 2012. [15] Y . Y ang. Minimax nonparametric classiﬁcation. I. Rates of con vergence. IEEE T ransactions on Informa- tion Theory , 45(7):2271–2284, 1999. [16] I. Steinwart. Consistency of support vector machines and other regularized kernel classiﬁers. IEEE T ransactions on Information Theory , 51(1):128–142, 2005. [17] T . Zhang. Statistical behavior and consistency of classiﬁcation methods based on con vex risk minimiza- tion. Annals of Statistics , 32(1), 2004. [18] P .L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Con ve xity , classiﬁcation, and risk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006. [19] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Resear ch , 2:499–526, 2002. [20] Karthik Sridharan, Shai Shale v-Shwartz, and Nathan Srebro. Fast rates for regularized objectives. In Advances in Neural Information Pr ocessing Systems , pages 1545–1552, 2008. [21] A. Ozakin and A. Gray . Submanifold density estimation. Advances in Neur al Information Pr ocessing Systems , 22, 2009. [22] K. Y u, T . Zhang, and Y . Gong. Nonlinear learning using l ocal coordinate coding. Advances in Neural Information Pr ocessing Systems , 2009. 12

Local Support Vector Machines:Formulation and Analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment