Local Support Vector Machines:Formulation and Analysis

We provide a formulation for Local Support Vector Machines (LSVMs) that generalizes previous formulations, and brings out the explicit connections to local polynomial learning used in nonparametric estimation literature. We investigate the simplest t…

Authors: Ravi Ganti, Alex, er Gray

Local Support V ector Machines: F ormulation and Analysis Ravi Ganti, Alexander Gray School of Computational Science & Engineering, Georgia T ech gmravi2003@gatech.edu, agray@cc.g atech.edu Abstract W e provide a formulation for Local Support V ector Machines (LSVMs) that gen- eralizes previous formulations, and brings out the explicit connections to local polynomial learning used in nonparametric estimation literature. W e in vesti- gate the simplest type of LSVMs called Local Linear Support V ector Machines (LLSVMs). For the first time we establish conditions under which LLSVMs make Bayes consistent predictions at each test point x 0 . W e also establish rates at which the local risk of LLSVMs con ver ges to the minimum value of e xpected local risk at each point x 0 . Using stability ar guments we establish generalization error bounds for LLSVMs. 1 Introduction W e consider the problem of binary classification, where we are gi ven a sample S of n i.i.d points, S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∈ ( X × {− 1 , 1 } ) n , X ⊂ R d , and we are required to learn a classifier g n : X → {− 1 , +1 } using S . Binary classification is a well studied problem in machine learning [1, 2]. One of the simplest classification algorithm is the k-nearest neighbour (kNN) algorithm. The kNN algorithm takes a majority vote ov er the k nearest neighbours of a test point x 0 in order to determine the label of x 0 . The kNN algorithm, among others, belongs to the class of local learning algorithms that take into account only the local information around the test point x 0 in deciding the label of x 0 . Cortes and V apnik [3] introduced the celebrated support vector machine (SVM), which learns a decision boundary by maximizing the margin. Zhang et al. [4], and Blanzieri et al. [5] independently proposed a classification algorithm called Local Support V ector Machines (LSVMs) and applied it to remote sensing and visual recognition tasks respectively . LSVMs exploit locality , like kNN, along with the idea of large margin classification, to learn a global non-linear classifier , by learning an SVM locally at each test point. By using a lar ge margin approach LSVMs inherit large margin classifier’ s robustness to data perturbation. By utilizing only local information, LSVMs av oid relying on the global geometry of the distribution. This is a good strategy when our data lies on a manifold, where the geodesic distance between close points is approximately Euclidean, b ut the same is not true for points far aw ay . A prime example is the task of image recognition. Segata et al. [6] compared their implementation of approximate LSVM with a standard RBF SVM in LIBSVM for the 2-spirals dataset. The 2-spirals dataset is a 2-dimensional dataset where the data liv es on a manifold. The accurac y of LSVM was 88.47% whereas that of SVM was only 85.29%, and that of kNN was 88.43%. Another scenario where local learning is more beneficial than global learning is when data is multimodal and/or heterogeneous. For e xample, suppose given census data we are required to classify people as belonging to the high income group (HIG) or the middle income group (MIG). A global HIG vs MIG classifier might be hard to build since the notion of HIG/MIG changes with states/counties. Howe ver , it is better to utilize local information, such as information from a particular county , to build multiple local classifiers. Our global classifier is then a collection of many such local classifiers.The problem of webspam detection is another example where data is heterogeneous. A page might be webspam for one category but may not be for another . In such cases 1 in order to categorize a webpage as spam or not, it is easier to build multiple local classifiers rather than a single global classifier . A numerical illustration w as provided by Cheng et al. [7] On a one- vs-all classification problem on the Covtype dataset, their implementation of LSVM registered an accuracy of about 90%, whereas the accuracy of RBF SVM was only 86.21% and that of kNN was 67.40% . Since a one-vs-all classification problem makes the dataset multimodal and heterogeneous (due to grouping of multiple classes as one single class), the superior performance of LSVMs over SVMs illustrates the power of local learning in such settings. A practical advantage of LSVMs is that they can exploit fast algorithms for range search [8], and v arious other approximation techniques [6], along with parallel computing architectures in order to learn on large datasets. SVMs are well understood both theoretically and practically [2, 3]. Howe ver , no theoretical under- standing yet exists for LSVMs. The empirical success of LSVMs begs a theoretical understanding of such techniques. Our work is the first attempt to provide a theoretical understanding of LSVMs. Our contributions are as follo ws: • W e provide a formulation of LSVMs, which generalizes previous formulations due to Blanzieri et al. [5], and Zhang et al. [4], and provide the first statistical analysis of locally linear support vector machines (LLSVMs), the simplest kind of LSVMs. Our formulation (Section (2)) is similar to Cheng et al. [7] but is more directly motiv ated from local polynomial regression, and hence the role of a smoothing kernel in our formulation, is much more cleaner than their use of an unspecified weight function. Our formulation makes e xplicit the direct connections to local polynomial fitting via the use of polynomial Mercer kernels. This allows us to view LSVMs as approximating the decision boundary locally using smooth functions, a nov el interpretation. • In Theorem (1) (Section (3)) we provide suf ficient conditions, which guarantee that, for any giv en point x 0 , the prediction of LLSVMs at x 0 matches that of the Bayes classifier , establishing pointwise consistency . These are conditions on the distrib ution and the model parameters λ, σ . • The LLSVM problem at any point x 0 minimizes the sample version of the stochastic objec- tiv e: min w λ 2 || w || 2 + E L ( y h w , x i ) K ( x, x 0 , σ ) . In Theorem (8) (Section (4)) we provide a high probability bound of ˜ O ( 1 √ nλσ 2 d ) for the dif ference between the smallest value of this stochastic objectiv e and, its value at the solution obtained by solving the LLSVM optimization problem. This result tells us how quickly the stochastic objecti ve, at the solution of the LLSVM problem, con ver ges to its true minimum value. • Define the L risk of any function f as E L ( y f ( x )) . In Theorem (9) (Section (4)) we establish an upper bound on the gap between the L risk of the global function learnt by LLSVMs, with bandwidth σ , and regularization λ , and the empirical L risk of LLSVMs, via uniform stability bounds. This gap decays as O ( 1 √ nλσ d ) . Notice that while theorems (1),(8) are pointwise results, theorem (9) in v olves the global classifier learnt by solving the LLSVM problem at each training point x i with parameters λ, σ . Hence this is a “global” result. Theorems (1), (9) suggest that LLSVMs should work well in low dimensions, or if the data lies in a lo w-dimensional manifold. This justifies the empirical findings of Segata et al., and Cheng et al. 2 F ormulation of LLSVMs and LSVMs. Our formulation for LLSVMs is directly motiv ated from a certain technique in nonparametric re- gression called local linear re gression (LLR) [9]. In LLR one fits a non-linear regression function by fitting a linear function locally at each point. The idea of local linear fit is inspired by the fact that any differentiable function can be well approximated locally via linear functions. Hence LLR locally approximates the underlying regression function with a linear function. LLSVMs adopt a similar approach by making local linear fits to the underlying non-linear decision boundary . In order to classify an unseen point x 0 , LLSVMs solve the problem ˆ w reg ∗ = arg min w λ 2 || w || 2 + 1 n P n i =1 L ( y i h w , x i i ) K ( x i , x 0 , σ ) , (1) where K ( x i , x 0 , σ ) is a smoothing kernel with bandwidth σ and L ( · ) is a Lipschitz, con ve x upper bound to the 0-1 loss. In this paper we will be concerned with the hinge loss L ( t ) def = max { 1 − t, 0 } , which is used in SVMs. Some popular examples of smoothing kernels 1 are the Epanechnikov 1 Note that smoothing kernels are not the same as the Mercer kernels used in SVM. A popular example of a smoothing kernel that is not a Mercer kernel is the Epanechnik ov kernel. 2 kernel, the rectangular kernel [9]. Replacing the term L ( y i h w , x i i ) in equation (1) with the term L ( y i h w , φ ( x i ) i ) , where φ ( x i ) is a kernel map induced by a Mercer kernel, we get LSVMs. w ∗ = arg min w λ 2 || w || 2 + 1 n n X i =1 L ( y i h w , φ ( x i ) i ) K ( x i , x 0 , σ ) . (2) Our formulation of LSVMs as sho wn in equation (1) strictly generalizes the formulation of both Blanzieri et al. and Zhang et al.. Strictly speaking, their algorithm used a rectangular smoothing kernel with the bandwidth equal to the distance of the k th nearest neighbour of the test point x 0 in the training set. In comparison our formulation uses a smoothing kernel that allows the formulation to down-weight points in a smooth fashion. The vector ˆ w reg ∗ that is learnt by solving the optimization problem (1) is used for classification at x 0 only . Hence, unlike linear SVMs, LLSVMs are still non- linear as the linear fits are only local, and the smoothing kernel precisely determines the locality at each x 0 . T o see a simple example of the influence of a smoothing kernel, consider LLSVMs with the hinge loss. Standard primal-dual calculations yield ˆ w reg ∗ = 1 λ n X i =1 α i y i K ( x i , x 0 , σ ) x i , 0 ≤ α i ≤ 1 /n (3) where α i are the dual variables. If one uses a finite tailed smoothing kernel such as an Epanechnik ov kernel or a rectangular kernel, then the points x i which are outside the bandwidth of the kernel, i.e. K ( x i , x 0 , σ ) =0, hav e no ef fect on ˆ w reg ∗ . Hence, the resulting LLSVM does not care about these points and tries to maximize the margin in the input space using points that are close to x 0 . Mercer kernels, that arise out of the kernel map φ ( · ) , on the other hand ha ve nothing to do with locality . Instead they allow us to fit non-linear functions. If one uses a polynomial Mercer kernel of degree d , in conjunction with a smoothing kernel, then it is equiv alent to making local degree d approximations to the boundary function. While such approximations are potentially more po werful than local linear approximations, one would require stronger conditions such as existence of higher order deriv ativ es of the decision boundary , to justify local polynomial approximations. T o avoid making such strong assumptions we shall focus on LLSVMs in this paper . The formulation of Cheng et al. [7] is similar to the optimization problem (2), but uses an unspecified weight function, σ ( x i , x 0 ) , in the place of K ( x i , x 0 , σ ) . While the importance of the smoothing kernel, and its interaction with Mercer kernels has been distilled in our formulation, the importance and impact of the weight function in the formulation of Cheng et al. w as not done clearly . Related W ork. Kernel based rules (KBRs) have been proposed as a nonparametric classification method (see chapter 10 in [10]) and are essentially a simplified version of LLSVMs. KBRs predict the label of a point x 0 as sgn ( P n i =1 y i K ( x i , x 0 , σ )) . This can be seen as using equation (3) b ut with all α i ’ s set to a constant. Howe ver , for LLSVMs these α v alues themselv es depend on the training data, and hence results from the KBRs literature do not transfer to our case. Learning multiple local classifiers has also been done by first clustering the data and then learning a classifier in each of these clusters [11, 12], or by using a baseline classifier [13, 14] to find regions where the classifier commits errors and then learning a dedicated classifiers for each of these erroneous regions. All these algorithms are different from LLSVMs as the y learn a finite mixture of local classifiers from the training data and classify the test point as per the appropriate mixture component. In contrast LLSVMs learn a local classifier on demand for each test point. Ensemble methods also learn multiple classifiers and combine them to learn a global model. Ho we ver , the classification model is fixed and does not change from one test point to another . Notation. [ n ] def = { 1 , . . . , n } . Let B ( x, σ ) denote a d dimensional ball of radius σ centered around x . Also, let h a, b i def = a T b . Throughout the paper we shall use w ∈ R d +1 to denote a vector learned by using the training data set with a 1 appended to each training point in the data set as the ( d + 1) th dimension. If x ∈ R d then h w , x i def = h w , ¯ x i where ¯ x ∈ R d +1 def = ( x, 1) . Denote by f : X → R an arbitary measurable function. Finally since most of our results are “pointwise results”, we shall use x 0 to represent an arbitrary point, and all “local quantities” will be defined w .r .t. x 0 . 3 Pointwise Consistency of LLSVMs W e now state the assumptions and our first main result. 3 • A0: The domain X ⊂ R d is compact, || x || 2 ≤ M for all x in X , and the marginal distribution on X is absolutely continuous w .r .t. the Lebesgue measure. • A1: Let C 1 denote the class of functions that are at least once differentiable on X . W e assume that η ( x ) := P [ y = 1 | x ] ∈ C 1 , and as a result f B ( x ) def = 2 η ( x ) − 1 ∈ C 1 . Such smoothness assumptions (and stronger ones) have been used to study minimax rates for classification in [15]. The impact of A1 is two fold. Firstly the minimizer of L risk is a function of η . For hinge loss this function is sgn (2 η ( x ) − 1) . The same holds true ev en for “local” versions of L risk and the 0 − 1 risk. Since η ∈ C 1 , one can in voke continuity ar guments, to guarantee a small enough radius σ , where the minimizer of the L risk is a smooth function. Hence, one can restrict the search for an optimal function to C 1 . The definition of such local quantities is done in Section (3.2). • A2: K ( · , · , · ) is a finite tailed smoothing k ernel function that satisfies K ( · , · , · ) ≥ 0 (positive kernel), R x 2 ∈ R d K ( x 2 , x 1 , σ ) d x 2 = 1 for all x 1 ∈ X , vanishes for all x / ∈ B ( x 1 , σ ) , and K ( x 1 , x 2 , σ ) ≤ K m = Θ( 1 σ d ) for all x 1 , x 2 ∈ R d . Assumption A2 are standard assumptions from the nonparametric estimation literature [9]. The finite tail assumption of the smoothing kernel simplifies proofs and should be easy to relax. • A3: For all x 0 in X , lim σ → 0 E [ o ( || x − x 0 || ) K ( x, x 0 , σ )] = 0 . A3 allows us, in the limit, to approximate the minimum local L risk using only linear functions, and therefore allows us to model non-linear decision boundaries via locally linear fits. • A4: Let H σ be the region of intersection of a halfspace and B ( x 0 , σ ) , such that V ol ( H σ ) V ol ( B ( x 0 ,σ )) ≥ 1 2 . Then a.s. w .r .t. D X , lim σ → 0 inf H σ E x ∼D X K ( x, x 0 , σ ) 1 H σ = c 0 x 0 > 0 . A4 requires that the mass in B ( x 0 , σ ) for small σ is spread out and is not all located in a small region in B ( x 0 , σ ) . As a simple example, consider the setup where the mar ginal distribution has uniform density on [ − 1 , +1] , and the k ernel function is the Epanechnikov kernel. Under this setting, for any x 0 ∈ ( − 1 , 1) , we get lim σ → 0 E [ o ( || x − x 0 || ) K ( x, x 0 , σ )] = lim σ → 0 3 8 σ R x 0 + σ x 0 − σ | x − x 0 | (1 − ( x − x 0 σ ) 2 ) d x ≤ lim σ → 0 3 σ 16 = 0 . The same result applies even for x 0 ∈ {− 1 , +1 } . Hence assumption A3 is satisfied. T o verify the validity of A4, it is enough to see that lim σ → 0 inf θ ∈ [0 ,σ ] 3 4 σ R x 0 + θ x 0 − σ (1 − ( x − x 0 σ ) 2 ) d x = 1 / 4 . Hence c 0 x 0 = 1 / 4 . Finally , we shall work with only the hinge loss. Hence, whenev er we refer to L risk we basically mean the risk due to the hinge loss. W e are no w in a position to state our first result re garding pointwise consistency of LLSVMs. Theorem 1. Given an x 0 ∈ X , if assumptions A1-A4 hold, then there exists an n 0 ∈ N such that an LLSVM that solves the pr oblem (1) at x 0 , agrees with the Bayes classifier at x 0 , for all n ≥ n 0 , and appr opriate σ, λ > 0 that satisfy n → ∞ , λ, σ → 0 , such that nλ 2 σ 4 d ln 1+ θ n → ∞ for some θ > 0 . 3.1 Discussion of theorem 1. Theorem (1) provides us with conditions on n, λ, σ to guarantee that the learnt LLSVM makes a Bayes consistent decision at an arbitrary test point x 0 . Like KBR, SVMs, and LPR, our results require n to gro w and λ, σ to decay at certain rates that are precisely captured by theorem (1). In classification, global consistency results [10, 16] are proved which demonstrate that that the 0-1 risk of the classifier conv erges to that of a Bayes classifier asymptotically . Such global consistency results are asymptotic in nature. In comparison we prove that, at an arbitrary x 0 , we can choose sufficiently large amount of data, and appropriate parameter settings λ, σ (depending on n ) such that the LLSVMs decision matches that of the Bayes classifier at x 0 . For these reasons it seems inappro- priate to compare consistency results of SVMs with those for pointwise consistency of LLSVMs. Proving a global consistency result for LLSVMs remains an open problem, that we intend to tackle in the future. In the case of LPR, howe ver , pointwise properties has been inv estigated [9], such as ho w quickly the squared loss of an LPR estimator at point x 0 con ver ges to the squared loss of the true function. Here we are guaranteed that as n → ∞ , and with appropriate σ , and with any degree of the polynomial, the excess error at x 0 con ver ges to 0. Howe ver , as stated abo ve, we can prov e that we can predict the label of x 0 correctly with a finite amount of data. It is inappropriate to compare the results of LPR and LLSVMs, since LPR requires prediction of a real valued quantity , whereas LLSVMs are concerned with prediction of a binary label. As we mentioned in the related work section, the proof strategy that was used to pro ve the consistency of kernel based rules does not work for our case. T echniques from the literature for LPR [9] cannot be used for proving the pointwise consistency result for LLSVMs. This is because, in LPR we are interested in the squared 4 loss of the estimator . Squared loss allows a bias-variance decomposition, and the analysis requires the analysis of this decomposition. Howe ver , in classification we are concerned with 0-1 loss, which does not allow such a decomposition. 3.2 Overview of Pr oof of Theorem (1) As the proof of theorem (1) is quite inv olved we shall first present an overvie w of our proof. Since the statement of theorem (1) is for each point x 0 , we will define certain local quantities, and use them throught our proof. Our proof has three main steps. W e first establish the approximation properties of our function class. W e then make a connection between 0-1 risk and the L risk, since the LLSVM problem works with the L risk. Finally , we need a bound on the estimation error of LLSVM , which roughly says, how good is the LLSVM objective as a proxy to the expected local L risk. W e shall explain these three main steps in greater detail now . W e borrow some of the ideas from the proof of consistency of SVMs by Steinwart [16], and shall make appropriate comparisons whenev er required. The first step (Lemma (3)) is to establish the local approximation properties of linear func- tions. In order to do so we define the r e gularized local L risk , R reg ( w ) def = λ 2 || w || 2 + E [ L ( y h w, x i ) K ( x, x 0 , σ )] , and its corresponding unregurlarized version, called local L risk , R ( f ) = E [ L ( y f ( x )) K ( x, x 0 , σ )] . The minimizer of r e gularized local L risk among linear func- tions is denoted as w reg ∗ = arg min w R reg ( w ) . In Lemma (3) we pro ve that for small enough λ, σ , the minimum of local L risk among C 1 functions, i.e. inf f ∈ C 1 R ( f ) , can be well approximated by R reg ( w reg ∗ ) . A similar type of result, although with global quantities, was proved by Steinwart for SVMs. Howe ver , there are two main dif ferences. Firstly Steinw art’ s proof exploited the uni versal properties of RKHS spaces. Since we work with linear kernels which are not univ ersal, Steinwart’ s arguments do not apply here. W e instead use local approximation of C 1 functions by linear func- tions, which is made possible by a simple use of T aylor’ s expansion. Secondly while we work with C 1 functions, Steinwart’ s proof w orks with the space of all measurable functions. This is because their proof does not mak e any assumptions on the smoothness of η ( · ) . Ho we ver , our assumption A1 guarantees that it is enough to work with just C 1 functions. The second step (Lemma (4)) connects L risk with 0-1 risk. In order to do so we define the local risk of a function f , as R 0 − 1 ( f ) = E [ 1 [ y f ( x ) ≤ 0] K ( x, x 0 , σ )] . The excess local risk of f is simply R 0 − 1 ( f ) − inf f R 0 − 1 ( f ) , which we prov e in lemma (2) to be equal to R 0 − 1 ( f ) − R 0 − 1 ( f B ) . In lemma (4) we prove that, for small enough σ , the difference between the local L risk of a function, f , and a function, in C 1 , with the smallest local risk, is an upper bound on the excess local 0-1 risk of f . This result is nothing but a local v ersion of the result that was first stated in [17, 18]. In the third step, via lemmas (5)-(7) we bound the de viation of the empir- ical local risk, ˆ R ( w ) def = 1 n P n i =1 L ( y i h w , x i i ) K ( x i , x 0 , σ ) , from the local risk, R ( w ) def = E L ( y h w, x i ) K ( x, x 0 , σ ) , for the solution of problem (1). This is done via uniform stability arguments [19]. A similar result was also used by Steinwart, albeit, for global quantities. The fourth and final step puts together all these results to establish conditions for a.s. con v ergence of the sequence R 0 − 1 ( ˆ w reg ∗ ) def = E [ 1 ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ )] to inf f ∈ C 1 R ( f ) . W e then use this stochastic conv ergence along with assumption A4 to establish theorem (1). The proof of this final step exploits the fact that η is a continuous function. Lemma 2. Let f ∗ = arg inf f R 0 − 1 ( f ) . Then, ∀ x ∈ B ( x 0 , σ ) , f ∗ ( x ) ≥ 0 ⇔ η ( x ) ≥ 1 2 . Hence R 0 − 1 ( f ∗ ) = R x 0 ( f B ) . Pr oof. W e have R 0 − 1 ( f ) = E x ( η ( x )1( f ( x ) < 0) + (1 − η ( x ))1( f ( x ) > 0)) K ( x, x 0 , σ )) . Hence, R 0 − 1 ( f ) − R 0 − 1 ( f ∗ ) = E x (2 η ( x ) − 1)(1( f ( x ) ≥ 0) − 1( f ∗ ( x ) < 0)) K ( x, x 0 , σ ) . Now by definition the above term is non-ne gativ e for all measurable functions f . Hence in B ( x 0 , σ ) the behavior of f ∗ is exactly the same as that of Bayes classifier . The abo ve lemma tells us that even though the local risk uses a kernel function to weight the loss function, the minimizer of the local 0 - 1 risk, in a σ neighborhood of x 0 , behav es lik e the Bayes optimal classifier . This simple yet crucial result, would not be valid if one used a kernel that could take ne gati ve v alues (neg ativ e kernels). i.e. with a negati ve kernel it is not possible to guarantee that f ∗ x 0 ,σ > 0 ⇔ η ( x ) 1 [ x ∈ B ( x 0 , σ )] ≥ 1 / 2 . 5 Lemma 3. Under assumptions A1-A3, at any point x 0 ∈ R d , w r eg ∗ satisfies the pr operty lim σ → 0 [ lim λ → 0 R r eg ( w r eg ∗ ) − inf f ∈ C 1 R ( f )] = 0 . (4) Pr oof. Step 1. W e shall be gin by proving the follo wing statement. ∀ σ > 0 : lim λ → 0 R reg ( w reg ∗ ) = inf w R ( w ) . (5) Fix a σ > 0 , and let  > 0 be giv en. Since R ( · ) is a continuous conv ex function, hence it is possible to find atleast one w ,σ with || w ,σ || < ∞ , such that R ( w ,σ ) ≤ inf w R ( w ) +  Since λ 2 || w || 2 is continuous in λ , there e xists a λ ( , σ ) such that for all λ ≤ λ ( , σ ) : λ 2 || w ,σ || 2 ≤  . Now for an y λ ≤ λ 0 , we get R reg ( w reg ∗ ) ≤ R reg ( w ,σ ) = λ 2 || w ,σ || 2 + E L ( y h w ,σ , x i ) K ( x, x 0 , σ ) ≤ 2  + inf w R ( w ) . (6) Since  was arbitrary equation (5) follo ws. Step 2. In the second step we pro ve that lim σ → 0 [inf w R ( w ) − inf f ∈ C 1 R ( f )] = 0 . Suppose the real valued function g σ is the minimizer of R ( f ) for f ∈ C 1 . By T aylor expansion we hav e g σ ( x ) = g σ ( x 0 ) + D g σ ( x 0 )( x − x 0 ) + o ( || x − x 0 || ) . Hence, inf w R ( w ) − inf f ∈ C 1 R ( f ) = inf w R ( w ) − inf f ∈ C 1 R ( f ) ≤ R ( w ∗ ) − R ( g σ ) = E [( L ( y h w ∗ , x i ) − L ( y g σ ( x ))) K ( x, x 0 , σ )] ≤ E [ o ( || x − x 0 || ) K ( x, x 0 , σ )] → 0 , where the last step is due to A2. This completes the proof of our second part. Lemma 4. S uppose η ( x 0 ) 6 = 1 / 2 . Then for a suf ficiently small σ , such that η ( x ) 6 = 1 / 2 for any x ∈ B ( x 0 , σ ) , we get R 0 − 1 ( f ) − inf f R 0 − 1 ( f ) ≤ R ( f ) − inf f R ( f ) . Pr oof. Define ∆ = { x | f ( x ) f B ( x ) < 0 } , f ∗ L ( x ) = sgn (2 η ( x ) − 1) . R 0 − 1 ( f ) − R 0 − 1 = E [ | 2 η ( x ) − 1 | K ( x, x 0 , σ ) 1 ∆ ] (a) ≤ E [(1 − η ( x ) L ( f ∗ L ) − (1 − η ( x )) L ( f ∗ L )) K ( x, x 0 , σ ) 1 ∆ ] (b) ≤ R ( f ) − R ( f ∗ L )) In step (a) we used the fact that for the hinge loss | 2 η ( x ) − 1 | ≤ 1 − ( η ( x ) L ( f ∗ L ) + (1 − η ( x )) L ( f ∗ L )) , and in step (b) we used the fact that on the e vent ∆ , it is better to predict using the 0 function rather than predicting with f . W e now need the notion of uniform stability to establish the concentration result, which were out- lined in the proof overvie w . Roughly uniform stability [19] bounds the dif ference in loss of a learning algorithm, at any arbitrary point, due to remo val of an y one point from the training dataset. Lemma 5. LLSVMs obtained by solving the optimization problem (1) at any point x 0 has uniform stability of O  2 M 2 nλσ 2 d  w .r .t. the loss function L ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ ) . Pr oof. Let ˆ w reg ∗ , ˆ w − i, reg ∗ be the LLSVMs learned at x 0 using data sets S , S − i respectiv ely . F or any z = ( x, y ) ∈ X × {− 1 , +1 } , we ha ve  L ( y h ˆ w reg ∗ , x i ) − L ( y h ˆ w − i, reg ∗ , x i )  K ( x, x 0 , σ ) ≤ M K m || ˆ w reg ∗ − ˆ w − i, reg ∗ || . (7) Hence it is enough to bound || ˆ w reg ∗ − ˆ w − i, reg ∗ || . By definition both ˆ w reg ∗ , ˆ w − i, reg ∗ are solutions of their respectiv e con vex optimization problem. Let N ( w ) def = λ 2 || w − ˆ w − i, reg ∗ || 2 + 1 n  n X j =1 dL ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j − n X j 6 = i dL ( y j h ˆ w − i, reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j , w − ˆ w − i, reg ∗  , (8) 6 where dL ( . ) is an element of the subgradient of L at the appropriate arguement. W e have N ( ˆ w − i, reg ∗ ) = 0 , dN ( ˆ w reg ∗ ) = 0 . Hence ˆ w reg ∗ is an optimal solution of the minimization prob- lem: min w N ( w ) , and we have N ( ˆ w reg ∗ ) ≤ N ( ˆ w − i, reg ∗ ) ≤ 0 . W e get λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (a) ≤ − 1 n h dL ( y i h ˆ w reg ∗ , x i i ) K ( x i , x 0 , σ ) y i x i , ˆ w reg ∗ − ˆ w − i, reg ∗ i ≤ M K m n || ˆ w reg ∗ − ˆ w − i, reg ∗ || . || ˆ w reg ∗ − ˆ w − i, reg ∗ || ≤ 2 M K m /nλ. (9) where the inequality in step (a) uses properties of con ve x functions. Using Equations (7), (9) we get  L ( y h ˆ w reg ∗ , x i ) − L ( y h ˆ w − i, reg ∗ , x i )  K ( x, x 0 , σ ) ≤ O ( 2 M 2 nλσ 2 d ) . Lemma 6. [19] Let A S be the hypothesis learnt by an algorithm A on dataset S , suc h that 0 ≤ L ( A S , z ) ≤ M 1 . Suppose A has uniform stability β w .r .t L ( · ) . Then, ∀ n ≥ 1 , δ ∈ (0 , 1) , we have P [ R − R emp ≥ 2 β +  ] ≤ exp( − 2 n 2 / (4 nβ + M 1 ) 2 ) . (10) Lemma 7. F or any point x 0 ∈ R d we have P h R ( ˆ w reg ∗ ) − ˆ R ( ˆ w reg ∗ ) ≥ 4 M 2 nλσ 2 d +  i ≤ exp − 2 nλ 2 σ 4 d  2 (8 M 2 + λσ d + M √ λσ d ) 2 ! . (11) Pr oof. The desired result follows from lemmas (5)-(10) and by substituting ˆ R ( ˆ w reg ∗ ) for R emp and R ( ˆ w reg ∗ ) for R in lemma (10), and by susbtituting M 1 = O ( 1 σ d ) + O ( M σ d √ λσ d ) , which was obtained by using the fact that hinge loss is 1-Lipschitz. Proof of Theorem (1) . The proof is in two parts. In the first part we shall prove that under the conditions stated in the premise of the theorem R 0 − 1 ( ˆ w reg ∗ ) → R 0 − 1 ( f B ) a.s. The second part then uses this almost sure con v ergene of local risk to guarantee that ˆ w reg ∗ and f B agree on the label of x 0 . Fix any  > 0 . Let δ (1) n,λ,σ def = exp  −  2 nσ 2 d 2(1+ M √ 2 λ E K ( x,x 0 ,σ )) 2  , δ (2) n,λ,σ def = exp  − 2 nλ 2 σ 4 d  2 (8 M 2 + λσ d + M √ λσ d ) 2  . Define δ n,λ,σ def = δ (1) n,λ,σ + δ (2) n,λ,σ . For appropriately chosen values of σ (  ) , λ ( σ (  )) we hav e with probability atleast 1 − δ n,λ,σ R reg ( ˆ w reg ∗ ) = λ 2 || ˆ w reg ∗ || 2 + R ( ˆ w reg ∗ ) (a) ≤ λ 2 || ˆ w reg ∗ || 2 + ˆ R ( ˆ w reg ∗ ) + 4 M 2 nλσ 2 d +  (b) ≤ λ 2 || w reg ∗ || 2 + ˆ R ( w reg ∗ ) + 4 M 2 nλσ 2 d +  (c) ≤ λ 2 || w reg ∗ || 2 + R ( w reg ∗ ) + 4 M 2 nλσ 2 d + 2  = R reg ( w reg ∗ ) + 4 M 2 nλσ 2 d + 2  (d) ≤ inf f ∈ C 1 R ( f ) + 4 M 2 nλσ 2 d + 4  + λ 2 || w ,σ || 2 + E ( o ( || x − x 0 || ) K ( x, x 0 , σ )) . (12) In the above equations step (a) follo ws from lemma (7), and hence there is a failure probability of at most δ (1) n,λ,σ . Step (b) follows from the fact that ˆ w reg ∗ is the minimizer of R reg , and step (c) uses the Hoef fding inequality , and incurs a failure probability of δ (2) n,λ,σ . Choosing small enough σ (  ) , λ ( σ (  ) ,  ) , inequality (d) follows from lemma (3). Applying lemma (4) we get with probability atleast 1 − δ n,λ,σ , R 0 − 1 ( ˆ w reg ∗ ) − inf f R 0 − 1 ( f ) ≤ R ( ˆ w reg ∗ ) − inf f ∈ C 1 R ≤ R reg ( ˆ w reg ∗ ) − inf f ∈ C 1 R ( f ) (a) ≤ 4 M 2 nλσ 2 d + 4  + λ 2 || w ,σ || 2 + E ( o ( || x − x 0 || ) K ( x, x 0 , σ )) . Step (a) follows from equation (12), and the fact that the marginal distribution on X is abso- lutely continuous. The absolute continuity gurantees that λ 2 || w ,σ || 2 → 0 . If n → ∞ , λ → 0 , σ → 0 , nλ 2 σ 4 d → ∞ we conclude that R 0 − 1 ( ˆ w reg ∗ ) → inf f ∈ C 1 R 0 − 1 ( f ) = R 0 − 1 ( f B ) in probability . Since for data-dependent choices of λ, σ that satisfy λ, σ → 0 , nλ 2 σ 4 d log 1+  ( n ) → ∞ , we get P ∞ n =1 δ n,λ,σ < ∞ , hence by Borel-Cantelli lemma the con ve gence R 0 − 1 ( ˆ w reg ∗ ) → inf f R 0 − 1 ( f ) = R 0 − 1 ( f B ) also happens almost surely . 7 r rrrrrrr Figure 1: All the points in this ball, of radius σ , centered around x 0 are labeled +1 by the Bayes classifier . The region of intersec- tion between the hyperplane, and the ball, which contains the cen- ter , x 0 , is misclassified by the hy- perplane. The volume of this re- gion is at least half of the v olume of the ball. W e shall now prove the second part. If η ( x 0 ) = 1 / 2 , then the prediction of LLSVMs at point x 0 is irrelev ant. Hence let η ( x 0 ) > 1 / 2 . The proof is the same if η ( x 0 ) < 1 / 2 . Choose σ 1 such that inf x ∈ B ( x 0 ,σ 1 ) 2 η ( x ) − 1 ≥ 2 η ( x 0 ) − 1 2 . Notice that because of continuity 2 η ( x ) − 1 has the same sign everywhere in B ( x 0 , σ 1 ) (see Figure (1)). From A5 we are guaranteed that there exists σ 2 > 0 such that for all 0 < σ ≤ σ 2 , we ha ve inf H σ E K ( x, x 0 , σ ) 1 H σ ≥ c 0 x 0 2 . Let 0 < σ 0 ≤ min { σ 1 , σ 2 } . Now from the first part of the proof we know that R 0 − 1 ( ˆ w reg ∗ ) → R 0 − 1 ( f B ) almost surely . This guarantees that there exists a suffi- ciently lar ge n 0 such that for appropriate σ ≤ σ 0 , and an appropri- ate choice of λ , we get P [ R 0 − 1 ( ˆ w reg ∗ ) − R 0 − 1 ( f B ) ≤ c 0 x 0 | 2 η ( x 0 ) − 1 | / 8] = 1 . (13) Now for the abov e choice of n 0 , λ, σ , represent by ∆ the region of disagreement between ˆ w reg ∗ and f B . Assume that x 0 ∈ ∆ . Since 2 η ( x ) − 1 has the same sign everywhere in B ( x 0 , σ ) , we get ∆ = { x ∈ B ( x 0 , σ ) |h ˆ w reg ∗ , x i ≤ 0 } , and hence the volume of ∆ is at least half of B ( x 0 , σ ) . Hence R 0 − 1 ( ˆ w reg ∗ ) − R 0 − 1 ( f B ) = E | 2 η ( x ) − 1 | K ( x, x 0 , σ ) 1 ∆ ≥ 2 η ( x 0 ) − 1 2 E K ( x, x 0 , σ ) 1 ∆ ≥ 2 η ( x 0 ) − 1 2 inf H σ E K ( x, x 0 , σ ) 1 H σ ≥ (2 η ( x 0 ) − 1) c 0 x 0 4 , (14) which is a contradiction to equation (13). Hence f B and ˆ w reg ∗ agree on the label of x 0 . 4 Risk Bounds and Rates of Con ver gence to Stochastic Objective. LLSVMs solves a local optimization problem that can be seen as minimizing an empirical ver- sion of the stochastic objective R reg ( w ) . It is then natural to ask as to ho w quickly does the value of the stochastic objecti ve for w = ˆ w reg ∗ con ver ge to the minima of the stochastic objectiv e? In Theorem (8) we demonstrate, via stability arguments, that for an arbitrary test point x 0 , this con- ver gence happens at the rate of O (1 / √ nλσ 2 d ) . In Theorem (9) we establish generalization bounds for a global classifier learnt by solving LLSVM’ s at any randomly chosen point x , in terms of the empirical error of LLSVMs. Due to lack of space the proofs are postponed to the supplement. Theorem 8. W ith pr obability at least 1 − δ over the r andom input training set we have R r eg ( ˆ w reg ∗ ) − R r eg ( w r eg ∗ ) ≤ ˜ O  1 √ nλσ 2 d  . (15) Discussion of theorem (8). In theorem (8), it might be possible to improve the dependence on n from 1 √ n to 1 n via the peeling idea [20]. Based on [20], we conjecture that the dependence on λ is optimal, while the dependence on σ may be improved from 1 /σ 2 d to 1 /σ d . Theorem 9. Let ˆ w reg ∗ ( x ) be the vector obtained by solving the LLSVM pr oblem, with parameter s λ, σ , at a randomly drawn point x . W ith pr obability at least 1 − δ over the random sample , we have E L ( y h ˆ w reg ∗ ( x ) , x i ) ≤ 1 n n X i =1 L ( y i h ˆ w reg ∗ ( x i ) , x i i ) + 4 M 2 nλσ d + (1 + O ( M q 1 /λσ d )) p ln(1 /δ ) / 2 n. Discussion of Theorem (9). W ithout any further noise assumptions, the dependence on n, λ, σ is optimal. W ith the Tsybakov’ s [2] noise assumption, it is possible to improve the dependence on n . The exponential dependence on d is expected, and is typical of nonparametric methods. 5 Proofs of Theor ems 8,9 For con venience we shall be gin with a risk bound from [19]. This risk bound relies on the notion of uniform stability . For any learning algorithm A that learns a function A S after having trained on the dataset S the uniform stability quantifies the absolute maginitude of the change in loss suffered by the algorithm at any arbitrary point in the space if an arbitrary x i is removed from the training dataset. The precise definition is as follo ws 8 Definition 1. [19] An algorithm A has uniform stability β w .r .t the loss function L if: ∀ S, ∀ i ∈ { 1 , . . . , n } , || L ( A S , · ) − L ( A S − i , · ) || ∞ ≤ β (16) Lemma 10. Let A be an algorithm with uniform stability β w .r .t a loss function 0 ≤ L ( A S , ( x, y )) ≤ M 1 , for all z def = ( x, y ) and all set S. Then for any n ≥ 1 , and any δ ∈ (0 , 1) , the following bound holds true with pr obability atleast 1 − δ over the random draw of the sample S. E z ∼D L ( A S , z ) ≤ 1 n n X i =1 L ( A s , z i ) + 2 β + (4 nβ + M 1 ) r log (1 /δ ) 2 n . (17) Theorem 8. W ith pr obability 1 − δ over the r andom input training set we have R r eg ( ˆ w reg ∗ ) − R r eg ( w r eg ∗ ) ≤ 2 β + (4 nβ + M 1 ) s log ( 1 δ ) 2 n , (18) wher e β ≤ 2 M K m nλ " r 2 λL (0) n v u u t n X j =1 K ( x j , x 0 , σ ) + M K m # (19) M 1 ≤ L (0) n n X j =1 K ( x j , x 0 , σ ) + L (0) K m + K m M v u u t 2 L (0) nλ n X j =1 K ( x j , x 0 , σ ) (20) Pr oof. The proof is via stability ar guments. Let z def = ( x, y ) . Consider the loss function q ( w, z ) = λ 2 || w || 2 + L ( y h w , x i ) K ( x, x 0 , σ ) − h λ 2 || w reg ∗ || 2 + L ( y h w reg ∗ , x i ) K ( x, x 0 , σ ) i . (21) It is enough to bound the stability of LLSVM’ s w .r .t the above loss function and also upper bound the above loss. In order to upper bound the stability of LLSVM’ s it is enough to upper bound for all S, ( x, y ) the quantity | q ( ˆ w reg ∗ , ( x, y )) − q ( ˆ w − i, reg ∗ , ( x, y )) | , where S − i is the dataset obtained from S by deleting the point ( x i , y i ) , and h ˆ w − i, reg ∗ , x i is the LLSVM learnt at x 0 with S − i . From Equation (21) it is clear that q ( w, z ) is λ strongly con ve x in w in L 2 norm. Hence by strong con ve xity q ( ˆ w reg ∗ , z ) ≥ q ( ˆ w − i, reg ∗ , z ) + ( ˆ w reg ∗ − ˆ w − i, reg ∗ ) T ∂ q ( ˆ w − i, reg ∗ , z )+ λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (22) Similarily we hav e q ( ˆ w − i, reg ∗ , z ) ≥ q ( ˆ w reg ∗ , z ) + ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w reg ∗ , z )+ λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (23) From equations (22,23) we get ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w reg ∗ , z ) + λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 ≤ q ( ˆ w − i, reg ∗ , z ) − q ( ˆ w reg ∗ , z ) ≤ ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w − i, reg ∗ , z ) − λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 (24) W e shall now upper and lower bound the rightmost and the leftmost terms respectively . Doing this will enable us to bound the stability . Differentiating Equation (21) w .r .t w we get ∂ q ( ˆ w − i, reg ∗ , z ) = λ ˆ w − i, reg ∗ + ∂ L ( y h ˆ w − i, reg ∗ , x i ) y K ( x, x 0 , σ ) x (25) ∂ q ( ˆ w reg ∗ , z ) = λ ˆ w reg ∗ + ∂ L ( y h ˆ w reg ∗ , x i ) y K ( x, x 0 , σ ) x (26) 9 Now in order to bound the rightmost term of equation (24) we use equation (25) to get ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T ∂ q ( ˆ w − i, reg ∗ , x ) − λ 2 || ˆ w − i, reg ∗ − ˆ w reg ∗ || 2 ≤ ( ˆ w − i, reg ∗ − ˆ w reg ∗ ) T h λ ˆ w − i, reg ∗ + ∂ L ( y h ˆ w − i, reg ∗ i ) y K ( x, x 0 , σ ) x i ≤ || ˆ w − i, reg ∗ − ˆ w reg ∗ || || λ ˆ w − i, reg ∗ + ∂ L ( y h ˆ w − i, reg ∗ i ) y K ( x, x 0 , σ ) x || (27) where the last inequality follo ws from Cauchy-Schwartz inequality . W e shall be gin by bounding || ˆ w − i, reg ∗ − ˆ w reg ∗ || . No w by the definition of ˆ w − i, reg ∗ , ˆ w reg ∗ we get λ ˆ w reg ∗ + 1 n n X j =1 L ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j = 0 (28) λ ˆ w − i, reg ∗ + 1 n n X j =1 j 6 = i L ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j = 0 (29) Now consider the follo wing con ve x optimization problem N ( w ) = λ 2 || w − ˆ w − i, reg ∗ || 2 + 1 n h n X j =1 ∂ L ( y j h w , x j i ) K ( x j , x 0 , σ ) y j x j − n X j =1 j 6 = i ∂ L ( y j h ˆ w reg ∗ , x j i ) K ( x j , x 0 , σ ) y j x j , w − ˆ w − i, reg ∗ i (30) It is trivial to v erify using equations (28,29) that ∂ N ( ˆ w reg ∗ ) ∂ w = 0 , and hence from con vex analysis we know that ˆ w reg ∗ is the optimal solution of the con ve x optimization problem N ( w ) . Also N ( ˆ w reg ∗ ) ≤ N ( ˆ w − i, reg ∗ ) = 0 . Hence we get λ 2 || ˆ w reg ∗ − ˆ w − i, reg ∗ || 2 ≤ − 1 n X j =1 j 6 = i h ∂ L ( y j h ˆ w reg ∗ , x j i ) y j K ( x j , x 0 , σ ) x j − ∂ L ( y j h ˆ w − i, reg ∗ , x j i ) y j K ( x j , x 0 , σ ) x j , ˆ w reg ∗ − ˆ w − i, reg ∗ i − 1 n h ∂ L ( y i h ˆ w reg ∗ , x i i ) y i K ( x i , x 0 , σ ) x i , ˆ w reg ∗ − ˆ w − i, reg ∗ i ≤ − 1 n h ∂ L ( y i h ˆ w reg ∗ , x i i ) y i K ( x i , x 0 , σ ) x i , ˆ w reg ∗ − ˆ w − i, reg ∗ i ≤ 1 n M K ( x i , x 0 , σ ) || ˆ w reg ∗ − ˆ w − i, reg ∗ || (31) where the second inequality is due to the f act that L ( · ) is a con vex loss function, and hence ( dL ( b ) − dL ( a ))( b − a ) ≥ 0 , and the last inequality due to Cauchy-Schwartz and the fact that L ( · ) is 1 Lipschitz. Hence we get || ˆ w − i, reg ∗ − ˆ w reg ∗ || ≤ 2 nλ M K ( x i , x 0 , σ ) . (32) Finally we hav e by the optimality of ˆ w − i, reg ∗ , ˆ w reg ∗ λ 2 || ˆ w − i, reg ∗ || 2 ≤ 1 n n X j =1 j 6 = i L (0) K ( x i , x 0 , σ ) (33) λ 2 || ˆ w reg ∗ || 2 ≤ 1 n n X j =1 L (0) K ( x j , x 0 , σ ) (34) 10 Using equations (24,25,27,32,33) we get q ( ˆ w − i, reg ∗ , z ) − q ( ˆ w reg ∗ , z ) ≤ 2 M K ( x i , x 0 , σ ) nλ h r 2 λL (0) n v u u u t n X j =1 j 6 = i K ( x j , x 0 , σ ) + M K ( x, x 0 , σ ) i (35) One can use similar techniques to lower bound the leftmost term in Equation (24) to get q ( ˆ w − i, reg ∗ , z ) − q ( ˆ w reg ∗ , z ) ≥ − 2 M nλ K ( x i , x 0 , σ ) h r 2 λL (0) n v u u t n X j =1 K ( x j , x 0 , σ ) + M K ( x, x 0 , σ ) i (36) Using the fact that β = sup S,z | q ( ˆ w reg ∗ , z ) − q ( ˆ w reg ∗ , z ) | and equations (35,36) we get β ≤ 2 M K m nλ h r 2 λL (0) n v u u t n X j =1 K ( x j , x 0 , σ ) + M K m i (37) In order to apply theorem (10) it is enough to upper bound q ( ˆ w reg ∗ , z ) . W e have q ( ˆ w reg ∗ , z ) ≤ λ 2 || ˆ w reg ∗ || 2 + L ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ ) ≤ L (0) n n X j =1 K ( x j , x 0 , σ ) + L ( y h ˆ w reg ∗ , x i ) K ( x, x 0 , σ ) ≤ L (0) n n X j =1 K ( x j , x 0 , σ ) + L (0) K m + K m M v u u t 2 L (0) nλ n X j =1 K ( x j , x 0 , σ ) . (38) Now applying theorem (10) to LLSVM’ s with the loss function q ( A s , z ) and since ˆ R reg ( ˆ w reg ∗ ) ≤ ˆ R reg ( w reg ∗ ) we get the desired result. Theorem 9. Let ˆ w reg ∗ ( x ) be the solution obtained by solving the LLSVM pr oblem at x . W ith pr ob- ability at least 1 − δ over the random sample for an LLSVM, we have E L ( y h ˆ w reg ∗ ( x ) , x i ) ≤ 1 n n X i =1 L ( y i h ˆ w reg ∗ ( x i ) , x i i ) + 4 M 2 nλσ d +  1 + O ( M q 1 /λσ d )  r ln(1 /δ ) 2 n . Pr oof. By lemma (10) we are done if we can upper bound the loss suffered by LLSVMs at any point, and the stability of LLSVMs w .r .t the loss L ( y h ˆ w reg ∗ ( x ) , x i ) . W e have | L ( y h ˆ w reg ∗ ( x ) , x i ) − L ( y h ˆ w − i, reg ∗ ( x ) , x i ) | = O  2 M 2 /nλσ d  , where we used the upper bound on || ˆ w reg ∗ ( x ) − ˆ w − i, reg ∗ ( x ) || presented in Equation 9 of Lemma 5 in the main paper . Finally L ( y h ˆ w reg ∗ ( x ) , x i ) ≤ 1 + M || ˆ w reg ∗ ( x ) || ≤ 1 + O ( M p 1 /λσ d ) . Apply lemma (10) with β = O (2 M 2 /nλσ d ) and M 1 = 1 + O ( M p 1 /λσ d ) to finish the proof. 6 Discussion and Open Problems Our results guarantee that the decision of an LLSVM learnt at x 0 matches that of the Bayes classifier after having seen enough data. An important open problem is to establish global Bayes consistency of LLSVMs. It is not clear to us if the pointwise consistency result can be used to do so. Theo- rem (9) currently does not e xploit our lar ge mar gin formulation. A natural e xtension of this theorem would be to establish a result that depends on some kind of a local notion of mar gin. Our current results depend on the dimensionality of the ambient space. It should be possible, under appropriate manifold assumptions [21], [22] to improv e this dependency to use the intrinsic dimension. 11 References [1] T . Hastie, R. T ibshirani, and J. H. Friedman. The Elements of Statistical Learning . Springer , July 2003. [2] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A surve y of some recent advances. ESAIM: P&S , 9:323–375, 2005. [3] V .N. V apnik. Statistical learning theory . W iley New Y ork, 1998. [4] H. Zhang, A.C. Ber g, M. Maire, and J. Malik. Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In Computer V ision and P attern Recognition, 2006 IEEE Computer Society Confer ence on , volume 2, pages 2126–2136. IEEE, 2006. [5] E. Blanzieri and F . Melgani. An adaptive svm nearest neighbor classifier for remotely sensed imagery . In Geoscience and Remote Sensing Symposium, 2006. IGARSS 2006. IEEE International Confer ence on , pages 3931–3934. IEEE, 2006. [6] N. Segata and E. Blanzieri. F ast and scalable local kernel machines. JMLR , 2010. [7] H. Cheng, P .N. T an, and R. Jin. Ef ficient Algorithm for Localized Support V ector Machine. IEEE T ransactions on Knowledg e and Data Engineering , pages 537–549, 2009. [8] A. Gray and A. W . Moore. N-body problems in statistical learning. In T odd K. Leen, Thomas G. Di- etterich, and V olker Tres p, editors, Advances in Neural Information Pr ocessing Systems 13 (December 2000) . MIT Press, 2001. [9] A.B. Tsybakov . Intr oduction to nonparametric estimation . Springer V erlag, 2009. [10] Luc Devro ye, L. Gy ¨ o rfi, and G. Lugosi. A Pr obabilistic Theory of P attern Recognition . Springer , 1996. [11] T .K. Kim and J. Kittler . Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. P attern Analysis and Machine Intelligence, IEEE T ransactions on , 27(3):318–327, 2005. [12] S. Y an, H. Zhang, Y . Hu, B. Zhang, and Q. Cheng. Discriminant analysis on embedded manifold. Com- puter V ision-ECCV 2004 , pages 121–132, 2004. [13] J. Dai, S. Y an, X. T ang, and J.T . Kwok. Locally adaptiv e classification piloted by uncertainty . In ICML , pages 225–232. A CM, 2006. [14] O. Dek el and O. Shamir . Theres a hole in my data space: Piecewise predictors for heterogeneous learning problems. In AIST A TS , 2012. [15] Y . Y ang. Minimax nonparametric classification. I. Rates of con vergence. IEEE T ransactions on Informa- tion Theory , 45(7):2271–2284, 1999. [16] I. Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE T ransactions on Information Theory , 51(1):128–142, 2005. [17] T . Zhang. Statistical behavior and consistency of classification methods based on con vex risk minimiza- tion. Annals of Statistics , 32(1), 2004. [18] P .L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Con ve xity , classification, and risk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006. [19] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Resear ch , 2:499–526, 2002. [20] Karthik Sridharan, Shai Shale v-Shwartz, and Nathan Srebro. Fast rates for regularized objectives. In Advances in Neural Information Pr ocessing Systems , pages 1545–1552, 2008. [21] A. Ozakin and A. Gray . Submanifold density estimation. Advances in Neur al Information Pr ocessing Systems , 22, 2009. [22] K. Y u, T . Zhang, and Y . Gong. Nonlinear learning using l ocal coordinate coding. Advances in Neural Information Pr ocessing Systems , 2009. 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment