Qualitative Robustness of Support Vector Machines
Support vector machines have attracted much attention in theoretical and in applied statistics. Main topics of recent interest are consistency, learning rates and robustness. In this article, it is shown that support vector machines are qualitatively…
Authors: Robert Hable, Andreas Christmann (University of Bayreuth)
Qualitativ e Robustness of Supp ort V ector Mac hines Rob ert Hable and Andreas Christmann Departmen t of Mathematics Univ ersity of Ba yreuth Abstract Supp ort vector machines hav e attracted m uch attention in theo- retical and in applied statistics. Main topics of recen t interest are consistency , learning rates and robustness. In this article, it is sho wn that support vector machines are qualitatively robust. Since support v ector machines can b e represen ted by a functional on the set of all probabilit y measures, qualitative robustness is prov en by sho wing that this functional is contin uous with resp ect to the top ology generated by w eak conv ergence of probabilit y measures. Com bined with the exis- tence and uniqueness of supp ort vector machines, our results show that supp ort v ector mac hines are the solutions of a w ell-posed mathematical problem in Hadamard’s sense. 2000 AMS Classification n umbers: 62G08, 62G35 KEYW ORDS: Nonparametric regression, classification, machine learning, supp ort vector machines, qualitative robustness 1 A Long In tro duction Tw o of the most important topics in statistics are classification and regres- sion. There, it is assumed that the outcome y ∈ Y of a random v ariable Y (output v ariable) is influenced by an observed v alue x ∈ X (input v ariable). On the basis of a finite data set ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ ( X × Y ) n , the goal is to find an “optimal” predictor f : X → Y whic h mak es a prediction f ( x ) for an unobserv ed y . In parametric statistics, a signal plus noise relationship y = f θ ( x ) + ε is often assumed, where f θ is precisely kno wn except for a finite parame- ter θ ∈ R p and ε is an error term (generated from a Normal distribution). In this wa y , the goal of estimating an “optimal” predictor (which can b e an y function f : X → Y ) reduces to the m uc h simpler task of estimating the parameter θ ∈ R p . Since, in many applications, suc h strong assump- tions can hardly b e justified, nonparametric regression has b een developed 1 whic h av oids (or at least considerably weak ens) such assumptions. In sta- tistical machine learning, the metho d of supp ort vector mac hines has b een dev elop ed as a metho d of nonparametric regression; see e.g., V apnik (1998), Sc h ¨ olk opf and Smola (2002), and Steinw art and Christmann (2008). There, the estimation of the predictor (called empiric al SVM ) is a function f which solv es the minimization problem min f ∈ H 1 n n X i =1 L ( x i , y i , f ( x i ) + λ k f k 2 H , (1) where H is a certain function space H . The first term in (1) is the empirical mean of the losses caused b y the predictions f ( x i ) and the second term p enalizes the complexity of f in order to a void ov erfitting, λ is a p ositiv e real n umber, and the space H is a repro ducing k ernel Hilb ert space (RKHS) whic h consists of functions f : X → R . Since the arise of robust statistics (T ukey (1960), Huber (1964)), it is w ell-known that imperceptible small deviations of the real w orld from mo del assumptions may lead to arbitrarily wrong conclusions. While man y prac- titioners are a ware of the need for robust metho ds in classical parametric statistics, it is quite often o verseen that robustness is also a crucial issue in nonparametric statistics. F or example, the sample mean can b e seen as a nonparametric pro cedure which is non-robust since it is extremely sensitive to outliers: Le t X 1 , . . . , X n b e i.i.d. random v ariables with unkno wn distri- bution P and the task is to estimate the exp ectation of P . If the observ ed data are really generated b y the ideal P (and if exp ectation and v ariance of P exist), then the sample mean is the optimal estimator. Ho wev er, it frequen tly happens in the real world that, due to outliers or small mo del violations, the observ ed data are not generated by the ideal P but by an- other distribution P 0 . Even if P 0 is close to the ideal P , the sample mean ma y lead to disastrous results. Detailed descriptions and some examples of suc h effects are giv en, e.g., in T ukey (1960), Huber (1964), and Huber (1981, § 1.1). In nonparametric regression, similar effects can o ccur. There, it is of- ten assumed that ( X 1 , Y 1 ) , . . . , ( X n , Y n ) are i.i.d. random v ariables with un- kno wn distribution P . This distribution P determines in which w ay the output v ariable Y i is influenced by the input v ariable X i . Ho wev er, estimat- ing a predictor f : X → Y can b e severely distorted if the observed data ( x 1 , y 1 ) , . . . , ( x n , y n ) are – just as usual – not generated by P but by another distribution P 0 whic h ma y b e close to the ideal P. In order to safeguard from severe distortions, an estimator S n should fulfill some kind of con tinu- it y: If the real distribution P 0 is close to the ideal distribution P , then the distribution of the estimator S n should hardly b e affected (uniformly in the sample sizes n ∈ N ). This kind of robustness is called qualitative r obustness and has b een formalized in Hamp el (1968, 1971) for estimators taking v alues 2 in R p . In order to study this notion of robust statistics for support vector ma- c hines, we need a generalization given b y Cuev as (1988) of this formalization b ecause, here, the v alues of the estimator are functions f : X → Y whic h are elements of a (typically infinite dimensional) Hilb ert space H . In case of support v ector mac hines, the estimators S n : ( X × Y ) n → H can be represen ted b y a functional S : M 1 ( X × Y ) → H on the set M 1 ( X × Y ) of all probabilit y measures on X × Y : S n ( x 1 , y 1 ) , . . . , ( x n , y n ) = S 1 n n X i =1 δ ( x i ,y i ) ! for every ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ X × Y where 1 n P n i =1 δ ( x i ,y i ) is the empirical measure and δ ( x i ,y i ) denotes the Dirac measure in ( x i , y i ) . It is shown b y Cuev as (1988) that, in such cases, the qualitativ e robustness of a sequence of estimators ( S n ) n ∈ N follo ws from the con tinuit y of the functional S (with resp ect to the top ology of w eak con v ergence of probabilit y measures). While quan titative robustness of supp ort v ector mac hines has already been in vesti- gated b y means of Hamp el’s influence functions and b ounds for the maxbias in Christmann and Stein w art (2007)) and by means of Bouligand influence functions in Christmann and V an Messem (2008), results ab out qualitative robustness of supp ort vector machines hav e not b een published so far. The goal of this pap er is to fill this gap on research on qualitativ e robustness of supp ort vector machines. The structure of the article is as follows: In the follo wing Section 2, w e recall the basic setup concerning supp ort vector mac hines, define the functional S whic h represents the SVM-estimators S n , n ∈ N , and quote the mathematical definition of qualitative robustness. In Section 3, we sho w that the functional S of supp ort v ector mac hines is, in fact, con tinuous under very mild assumptions (Theorem 3.2). In this wa y , it is also prov en that, under the same assumptions, support v ector mac hines are qualitatively robust (Theorem 3.1). In addition, it follows that empirical supp ort v ector mac hines are contin uous in the data – i.e., they are hardly affected b y sligh t c hanges in the data (Corollary 3.4). Under somewhat differen t assumptions, this has already b een shown in Stein wart and Christmann (2008, Lemma 5.13). Section 4 con tains some concluding remarks. All pro ofs are given in the Appendix. It has to b e p oin ted out that our results show that supp ort v ector ma- c hines are qualitativ ely robust with a fixe d regularization parameter λ ∈ 3 (0 , ∞ ). If the fixed regularization parameter λ is replaced b y a sequence of parameters λ n ∈ (0 , ∞ ) which decreases to 0 with increasing sample size n , then supp ort vector mac hines are not qualitativ ely robust any more under extremely mild conditions. This is demonstrated in Section 5.2 in the Ap- p endix. F rom our point of view, this is an imp ortant result as all universal consistency pro ofs w e know of for supp ort vector mac hines or for their risks, use an appropriate nul l se quenc e λ n ∈ (0 , ∞ ), n ∈ N . 2 Supp ort V ector Mac hines and Qualitativ e Ro- bustness Let (Ω , A , Q) b e a probabilit y space, let X b e a P olish space with Borel- σ - algebra B ( X ) and let Y b e a closed subset of R with Borel- σ -algebra B ( Y ) . The Borel- σ -algebra of X × Y is denoted by B ( X × Y ) and the set of all probabilit y measures on X × Y , B ( X × Y ) is denoted by M 1 ( X × Y ) . Let X 1 , . . . , X n : (Ω , A , Q) − → X , B ( X ) and Y 1 , . . . , Y n : (Ω , A , Q) − → Y , B ( Y ) b e random v ariables such that ( X 1 , Y 1 ) , . . . , ( X n , Y n ) are indep enden t and iden tically distributed according to some unkno wn probability measure P ∈ M 1 ( X × Y ) . A measurable map L : X × Y × R → [0 , ∞ ) is called loss function . It is assumed that L ( x, y , y ) = 0 for ev ery ( x, y ) ∈ X × Y – that is, the loss is zero if the prediction f ( x ) equals the observed v alue y . In addition, we will assume that L ( x, y , · ) : R → [0 , ∞ ) , t 7→ L ( x, y , t ) is conv ex for every ( x, y ) ∈ X × Y and that the follo wing uniform Lipsc hitz prop ert y is fulfilled for a p ositiv e real num b er | L | 1 ∈ (0 , ∞ ) : sup ( x,y ) ∈X ×Y L ( x, y , t ) − L ( x, y , t 0 ) ≤ | L | 1 · | t − t 0 | ∀ t, t 0 ∈ R . (2) W e restrict our attention to Lipsc hitz contin uous loss functions b ecause the use of loss functions which are not Lipschitz contin uous (such as the least squares loss on unbounded domains) usually conflicts with several notions of robustness; see, e.g., Steinw art and Christmann (2008, § 10.4). The risk of a measurable function f : X → R is defined b y R L, P ( f ) = Z X ×Y L x, y , f ( x ) P d ( x, y ) . 4 Let k : X × X → R b e a b ounded and contin uous kernel with r epr o duc- ing kernel Hilb ert sp ac e (RKHS) H . See e.g. Sch ¨ olk opf and Smola (2002) or Steinw art and Christmann (2008) for details ab out these concepts. Note that H is a P olish space since ev ery Hilb ert space is complete and, according to Steinw art and Christmann (2008, Lemma 4.29), H is separable. F urther- more, every f ∈ H is a b ounded and contin uous function f : X → R ; see Stein wart and Christmann (2008, Lemma 4.28). In particular, ev ery f ∈ H is measurable and its r e gularize d risk is defined to b e R L, P ,λ ( f ) = R L, P ( f ) + λ k f k 2 H . An elemen t f ∈ H is called a supp ort ve ctor machine and denoted by f L, P ,λ if it minimizes the regularized risk in H . That is, R L, P ( f L, P ,λ ) + λ k f L, P ,λ k 2 H = inf f ∈ H R L, P ( f ) + λ k f k 2 H . W e would like to consider a functional S : P 7→ f L, P ,λ . (3) Ho wev er, supp ort v ector mac hines f L, P ,λ need not exist for ev ery probabilit y measure P ∈ M 1 ( X × Y ) and, therefore, S cannot be defined on M 1 ( X × Y ) in this w ay . A sufficient condition for existence of a supp ort v ector machine based on a b ounded kernel k is, for example, R L, P (0) < ∞ ; see Steinw art and Christmann (2008, Corollary 5.3). In order to enlarge the applicabilit y of supp ort vector m ac hines, the following extension has been dev elop ed in Christmann et al. (2009). F ollo wing an idea already used by Huber (1967) for M-estimates in parametric mo dels, a shifte d loss function L ∗ : X × Y × R → R is defined by L ∗ ( x, y , t ) = L ( x, y , t ) − L ( x, y , 0) ∀ ( x, y , t ) ∈ X × Y × R . Then, similar to the original loss function L , define the L ∗ - risk b y R L ∗ , P ( f ) = Z L ∗ x, y , f ( x ) P d ( x, y ) and the regularized L ∗ - risk b y R L ∗ , P ,λ ( f ) = R L ∗ , P ( f ) + λ k f k 2 H for every f ∈ H . In complete analogy to f L, P ,λ , we define the support v ector mac hine based on the shifted loss function L ∗ b y f L ∗ , P ,λ = arg inf f ∈ H R L ∗ , P ( f ) + λ k f k 2 H . The following theorem summarizes some basic results derived by Christmann et al. (2009): 5 Theorem 2.1 F or any P ∈ M 1 ( X × Y ) , ther e exists a unique f L ∗ , P ,λ ∈ H which minimizes R L ∗ , P ,λ , i.e. R L ∗ , P ( f L ∗ , P ,λ ) + λ k f L ∗ , P ,λ k 2 H = inf f ∈ H R L ∗ , P ( f ) + λ k f k 2 H . If a supp ort ve ctor machine f L, P ,λ ∈ H exists (which minimizes R L, P ,λ in H ), then f L ∗ , P ,λ = f L, P ,λ . According to this theorem, the map S : M 1 ( X × Y ) → H , P 7→ f L ∗ , P ,λ exists, is uniquely defined and extends the functional in (3). Therefore, S ma y b e called SVM-functional . In order to estimate a measurable map f : X → R whic h minimizes the risk R L, P ( f ) = Z X ×Y L x, y , f ( x ) P d ( x, y ) , the SVM-estimator is defined b y S n : ( X × Y ) n → H , D n 7→ f L,D n ,λ where f L,D n ,λ is that function f ∈ H which minimizes 1 n n X i =1 L x i , y i , f ( x i ) + λ k f k 2 H in H for D n = (( x 1 , x 2 ) , . . . , ( x n , y n )) ∈ ( X × Y ) n . Let P D n b e the empirical measure corresponding to the data D n for sample size n ∈ N . Then, the definitions giv en ab ov e yield f L,D n ,λ = S n ( D n ) = S ( P D n ) = f L, P D n ,λ . (4) Note that the supp ort vector machine uniquely exists for every empirical measure. In particular, this also implies f L,D n ,λ = f L ∗ , P D n ,λ . The main go al of the article is to show that, under very mild c onditions, the se quenc e of SVM-estimators ( S n ) n ∈ N is qualitatively r obust. According to Cuev as (1988, Definition 1), the sequence ( S n ) n ∈ N is called qualitatively r obust if the functions M 1 ( X × Y ) → M 1 ( H ) , P 7→ S n (P n ) , n ∈ N , are uniformly con tin uous with resp ect to the w eak topologies on M 1 ( X × Y ) and M 1 ( H ) . Here, M 1 ( H ) denotes the set of all probability measures on ( H , B ( H )) , B ( H ) is the Borel- σ -algebra on H , and S n (P n ) denotes the 6 Figure 1: Sketc h: reasoning of robustness of S (P). Left: P, a neigh b orhoo d of P, and M 1 ( X × Y ). Right: S (P), a neighborho o d of S (P), and the space of all probability measures of S (P) for P ∈ M 1 ( X × Y ). image measure of P n with respect to S n . Hence, S n (P n ) is the measure on ( H , B ( H )) whic h is defined b y S n (P n ) ( F ) = P n D n ∈ ( X × Y ) n S n ( D n ) ∈ F for every Borel-measurable subset F ⊂ H . Of course, this definition only mak es sense if the SVM-estimators are measurable with resp ect to the Borel- σ -algebras. This measurabilit y is assured b y Corollary 3.4 b elo w. Since the weak top ologies on M 1 ( X × Y ) and M 1 ( H ) are metrizable b y the Prokhorov metric d Pro (see Subsection 5.1), the sequence of SVM- estimators ( S n ) n ∈ N is qualitativ ely robust if and only if for every P ∈ M 1 ( X × Y ) and ev ery ρ > 0 there is an ε > 0 suc h that d Pro (Q , P) < ε ⇒ d Pro S n (Q n ) , S n (P n ) < ρ ∀ n ∈ N . Roughly sp eaking, qualitativ e robustness means that the SVM-estimator tolerates t w o kinds of errors in the data: small errors in many observ ations ( x i , y i ) and large errors in a small fraction of the data set. These tw o kinds of errors only ha ve slight effects on the distribution and, therefore, on the p erformance of the SVM-estimator (uniformly in the sample size). Figure 1 giv es a graphical illustration of qualitative robustness. 3 Main Results The following theorem is our main result and sho ws that supp ort vector mac hines are qualitatively robust under mild conditions. Theorem 3.1 L et X b e a Polish sp ac e and let Y b e a close d subset of R . L et the loss function b e a c ontinuous function L : X × Y × R → [0 , ∞ ) such 7 that L ( x, y , y ) = 0 for every ( x, y ) ∈ X × Y and L ( x, y , · ) : R → [0 , ∞ ) , t 7→ L ( x, y , t ) is c onvex for every ( x, y ) ∈ X × Y . Assume that the uniform Lipschitz pr op erty sup ( x,y ) ∈X ×Y L ( x, y , t ) − L ( x, y , t 0 ) ≤ | L | 1 · | t − t 0 | ∀ t, t 0 ∈ R is fulfil le d for a r e al numb er | L | 1 ∈ (0 , ∞ ) . F urthermor e, let k : X × X → R b e a b ounde d and c ontinuous kernel with RKHS H . Then, the se quenc e of SVM-estimators ( S n ) n ∈ N is qualitatively r obust. Of course, this theorem applies to classification (e.g. Y = {− 1 , 1 } ) and regression (e.g. Y = R or Y = [0 , ∞ )). In particular, note that every function g : Y → R is contin uous if Y is a discrete set – e.g. Y = {− 1 , 1 } . In this case, assuming L to be con tinuous reduces to the assumption that X × R → [0 , ∞ ) , ( x, t ) 7→ L ( x, y , t ) is contin uous for every y ∈ Y . Man y of the most common loss functions are p ermitted in the theorem, e.g. the hinge loss and logistic loss for clas- sification, ε -insensitive loss and Hub er’s loss for regression, and the pin ball loss for quantile regression. The least squares loss is ruled out in Theorem 3.1 – which is not surprising as it is the prominen t standard example of a loss function which typically conflicts with robustness if X and Y are un- b ounded; see, e.g., Christmann and Steinw art (2007) and Christmann and V an Messem (2008). Assuming contin uit y of the k ernel k do es not seem to be v ery restrictive as all of the most common k ernels are contin uous. Assuming k to b e b ounded is quite natural in order to ensure go o d robustness prop er- ties. While the Gaussian RBF k ernel is alw ays b ounded, p olynomial kernels (except for the constant kernel) and the exp onen tial kernel are b ounded if and only if X is b ounded. In our definition of the sequence ( S n ) n ∈ N of SVM-estimators, the regu- larization parameter λ is a fixed real n umber whic h do es not c hange with n . Instead, it is also common to consider sequences of estimators T n : ( X × Y ) n → H , D n 7→ f L,D n ,λ n , n ∈ N , where the fixed parameter λ is replaced b y a sequence ( λ n ) n ∈ N ⊂ (0 , ∞ ) with lim n →∞ λ n = 0 . Ho w ever, Theorem 3.1 c annot b e generalized to ( T n ) n ∈ N . Prop osition 5.2 (in the Appendix) sho ws under extremely mild conditions that ( T n ) n ∈ N is not qualitativ ely robust. This is of interest b e- cause appropriately chosen null sequences ( λ n ) n ∈ N ⊂ (0 , ∞ ) are used to pro ve univ ersal consistency of the risk R L ∗ , P ( f L ∗ ,D n ,λ n ) P − → inf f ∈F R L ∗ , P ( f ) 8 and f L ∗ ,D n ,λ n P − → arg inf f ∈F R L ∗ , P ( f ) for n → ∞ where F denotes the set of all me asur able functions f : X → R . This was first shown by Steinw art (2002), Zhang (2004), and Steinw art (2005). W e also refer to Bousquet and Elisseeff (2002), Bartlett et al. (2006), Christmann et al. (2009), and Stein wart and Anghel (2009). The pro of of Theorem 3.1 is based on the following result which is in ter- esting on its o wn. Theorem 3.2 Under the assumptions of The or em 3.1, the SVM-functional S : M 1 ( X × Y ) → H , P 7→ f L ∗ , P ,λ is c ontinuous with r esp e ct to the we ak top olo gy on M 1 ( X × Y ) and the norm top olo gy on H . As a generalization of earlier results b y , e.g., Zhang (2001), De Vito et al. (2004), and Stein wart (2003), Christmann et al. (2009, Theorem 7) deriv ed a re presen ter theorem which show ed that, for every P 0 ∈ M 1 ( X × Y ), there is a b ounded map h : X × Y → R suc h that f L ∗ , P 0 ,λ = − 1 2 λ R h Φ d P 0 and f L ∗ , P ,λ − f L ∗ , P 0 ,λ H ≤ λ − 1 Z h Φ d P − Z h Φ d P 0 (5) for every P ∈ M 1 ( X × Y ) . The integrals in (5) are Bochner in tegrals of the v ector-v alued function h Φ : X × Y → H , ( x, y ) 7→ h ( x, y )Φ( x ) where Φ is the canonical feature map of k , i.e. Φ( x ) = k ( · , x ) for all x ∈ X . This offers an elegant p ossibilit y of pro ving Theorem 3.2 if w e w ould accept some additional assumptions: The statement of Theorem 3.2 is true if R h Φ d P n con verges to R h Φ d P 0 for ev ery weakly conv ergen t sequence P n → P 0 . In the following, w e sho w that the integrals indeed con verge – under the ad- ditional assumptions that the deriv ativ e ∂ L ∂ t ( x, y , t ) exists and is con tinuous for ev ery ( x, y , t ) ∈ X × Y × R . These assumptions are fulfilled e.g. for the logistic loss function and Hub er’s loss function. In this case, it follows from Christmann et al. (2009, Theorem 7) that h is contin uous. Since Φ is con- tin uous and b ounded (see e.g. Stein wart and Christmann (2008, p. 124 and Lemma 4.29), the integrand h Φ : X × Y → H is contin uous and b ounded. Then, it follows from Bourbaki (2004, p. I II.40) that R h Φ d P n con verges to R h Φ d P 0 for ev ery w eakly conv ergen t sequence P n → P 0 — just as in case of real-v alued in tegrands; see Subsection 5.1 in the Appendix. Unfortunately , this short pro of only w orks under the additional assump- tion of a contin uous partial deriv ativ e ∂ L ∂ t and this assumption rules out man y loss functions used in practice, such as hinge, absolute distance and ε - insensitiv e for regression and pin ball for quan tile regression. Therefore, our pro of of Theorem 3.2 (without this additional assumption) do es not use the represen ter theorem and Bo c hner in tegrals; it is mainly based on the theory 9 of Hilb ert spaces and weak con vergence of measures. In the follo wing, we giv e some corollaries of Theorem 3.2. Let C b ( X ) b e the Banac h space of all b ounded, con tinuous functions f : X → R with norm k f k ∞ = sup x ∈X | f ( x ) | . Since k is con tinuous and b ounded, we immediately get from Theorem 3.2 and Stein w art and Christmann (2008, Lemma 4.28): Corollary 3.3 Under the assumptions of The or em 3.1, the SVM-functional M 1 ( X × Y ) → C b ( X ) , P 7→ f L ∗ , P ,λ is c ontinuous with r esp e ct to the we ak top olo gy on M 1 ( X × Y ) and the norm top olo gy on C b ( X ) . That is, sup x ∈X f L, P 0 ,λ ( x ) − f L, P ,λ ( x ) is small if P 0 is close to P . The next corollary is similar to Stein w art and Christmann (2008, Lemma 5.13) but only assumes con tinuit y instead of differentiabilit y of t 7→ L ( x, y , t ). In com bination with existence and uniqueness of supp ort vector machines (see Theorem 2.1), this result sho ws that a supp ort vector machine is the solution of a w ell-p osed mathematical problem in the sense of Hadamard (1902). Corollary 3.4 Under the assumptions of The or em 3.1, the SVM-estimator S n : ( X × Y ) n → H , D n 7→ f L,D n ,λ is c ontinuous. In particular, it follows from Corollary 3.4 that the SVM-estimator S n is measurable. Remark 3.5 L et d n b e a metric which gener ates the top olo gy on ( X × Y ) n , e.g. the Euclide an metric on R n ( k +1) if X ⊂ R k . Then Cor ol lary 3.4 and Steinwart and Christmann (2008, L emma 4.28) imply the fol lowing c onti- nuity pr op erty of the SVM-estimator: F or every ε > 0 and every data set D n ∈ ( X × Y ) n , ther e is a δ > 0 such that sup x ∈X f L,D 0 n ,λ ( x ) − f L,D n ,λ ( x ) < ε if D 0 n ∈ ( X × Y ) n is any other data set with n observations and d n ( D 0 n , D n ) < δ . W e finish this section with a corollary ab out strong consistency of sup- p ort vector mac hines which arises as a b y-pro duct of Theorem 3.2. Often, 10 asymptotic results of support vector machines sho w the conv ergence in prob- abilit y of the risk R L ∗ , P ( f L ∗ , D n ,λ n ) to the Bayes risk inf f ∈F R L ∗ , P ( f ) and of f L ∗ , D n ,λ n to arg inf f ∈F R L ∗ , P ( f ) , where F is the set of all measurable functions f : X → R and ( λ n ) n ∈ N is a suitable null sequence. In con trast to that, the following corollary provides for fixe d λ ∈ (0 , ∞ ) almost sure con vergence of R L ∗ , P ( f L ∗ , D n ,λ ) to R L ∗ , P ( f L ∗ , P ,λ ) and of f L ∗ , D n ,λ to f L ∗ , P ,λ . This is an in teresting fact, although the limit R L ∗ , P ( f L ∗ , P ,λ ) will in general differ from the Ba yes risk. Recall from Section 2 that the data p oin ts ( x i , y i ) from the data set D n = ( x 1 , x 2 ) , . . . ( x n , y n ) are realizations of i.i.d. random v ariables ( X i , Y i ) : (Ω , A , Q) − → X × Y , B ( X × Y ) , n ∈ N , suc h that ( X i , Y i ) ∼ P ∀ n ∈ N . Corollary 3.6 Define the r andom ve ctors D n := ( X 1 , Y 1 ) , . . . , ( X n , Y n ) and the c orr esp onding H -value d r andom functions f L ∗ , D n ,λ = arg inf f ∈ H 1 n n X i =1 L ∗ X i , Y i , f ( X i ) + λ k f k 2 H , n ∈ N . F r om the assumptions of The or em 3.1, it fol lows that (a) lim n →∞ k f L ∗ , D n ,λ − f L ∗ , P ,λ k H = 0 almost sur e (b) lim n →∞ sup x ∈X | f L ∗ , D n ,λ ( x ) − f L ∗ , P ,λ ( x ) | = 0 almost sur e (c) lim n →∞ R L ∗ , P ,λ ( f L ∗ , D n ,λ ) = R L ∗ , P ,λ ( f L ∗ , P ,λ ) almost sur e (d) lim n →∞ R L ∗ , P ( f L ∗ , D n ,λ ) = R L ∗ , P ( f L ∗ , P ,λ ) almost sur e. If the supp ort ve ctor machine f L, P ,λ exists, then assertions (a)–(d) ar e also valid for L inste ad of L ∗ . 4 Conclusions It is w ell-known that outliers in data sets or other mo derate mo del viola- tions can pose a serious problem to a statistical analysis. On the one hand, practitioners can hardly guarantee that their data sets do not con tain any outliers, while, on the other hand, many statistical metho ds are very sensi- tiv e even to small violations of the assumed statistical mo del. Since supp ort v ector machines play an imp ortan t role in statistical machine learning, in- v estigating their p erformance in the presence of moderate mo del violations 11 is a crucial topic – the more so as supp ort vector mac hines are frequently applied to large and complex high-dimensional data sets. In this article, we sho wed that supp ort v ector machines are qualitatively robust with a fixed regularization parameter λ ∈ (0 , ∞ ), i.e., the p erfor- mance of supp ort vector mac hines is hardly affected by the follo wing t w o kinds of errors: large errors in a small fraction of the data set and small errors in the whole data set. This not only means that these errors do not lead to large errors in the supp ort v ector mac hines but also that even the finite sample distribution of supp ort v ector mac hines is hardly affected. In con trast to that, we also sho wed that supp ort vector machines are not qualitatively robust an y more under extremely mild conditions, if the fixed regularization parameter λ is replaced b y a sequence of parameters λ n ∈ (0 , ∞ ) which decreases to 0 with increasing sample size n . F rom our p oin t of view, this is an imp ortant result as all universal consistency pro ofs w e know of for support v ector machines or for their risks, use an appropriate nul l se quenc e λ n ∈ (0 , ∞ ), n ∈ N . 5 App endix In Subsection 5.1, w e briefly recall some facts ab out w eak conv ergence of probabilit y measures. In addition, we sho w that w eak conv ergence of prob- abilit y measures on a Polish space implies conv ergence of the corresp ond- ing Bo c hner in tegrals of b ounded, con tinuous functions. Subsection 5.2 demonstrates under extremely mild conditions that the sequence of SVM- estimators cannot b e qualitatively robust if the fixed regularization param- eter λ is replaced by a sequence ( λ n ) n ∈ N ⊂ (0 , ∞ ) with lim n →∞ λ n = 0 . Subsection 5.3 contains all proofs. 5.1 W eak Con v ergence of Probabilit y Measures and Bo chner In tegrals Let Z b e a P olish space with Borel- σ -algebra B ( Z ), let d b e a metric on Z whic h generates the top ology on Z and let M 1 ( Z ) b e the set of all proba- bilit y measures on ( Z , B ( Z )) . A sequence (P n ) n ∈ N of probabilit y measures on Z con verges to a prob- abilit y measure P 0 in the weak top ology on M 1 ( Z ) if lim n →∞ Z g d P n = Z g d P 0 ∀ g ∈ C b ( Z ) where C b ( Z ) denotes the set of all b ounded, con tin uous functions g : Z → R , see Billingsley (1968, § 1). The w eak top ology on M 1 ( Z ) is metrizable b y the Prokhorov metric d Pro ; see e.g. Hub er (1981, § 2.2). The Prokhoro v metric d Pro on M 1 ( Z ) is 12 defined b y d Pro (P 1 , P 2 ) = inf ε ∈ (0 , ∞ ) P 1 ( B ) < P 2 ( B ε ) + ε ∀ B ∈ B ( Z ) where B ε = { z ∈ Z | inf z 0 ∈Z d ( z , z 0 ) < ε } . Let g : Z → R b e a contin uous and b ounded function. By definition, w e ha ve lim n →∞ R g d P n = R g d P 0 for every sequence (P n ) n ∈ N ⊂ M 1 ( Z ) whic h con verges weakly in M 1 ( Z ) to some P 0 . The follo wing theorem states that this is still v alid for Bo c hner in tegrals if g is replaced by a vector-v alued con tinuous and b ounded function Ψ : Z → H , where H is a separable Banac h space. This follows from a corresp onding statement in Bourbaki (2004, p. I I I.40) for lo cally compact spaces Z . Boundedness of Ψ means that sup z ∈Z k Ψ( z ) k H < ∞ . Theorem 5.1 L et Z b e a Polish sp ac e with Bor el- σ -algebr a B ( Z ) and let H b e a sep ar able Banach sp ac e. If Ψ : Z → H is a c ontinuous and b ounde d function, then Z Ψ d P n − → Z Ψ d P 0 ( n → ∞ ) for every se quenc e (P n ) n ∈ N ⊂ M 1 ( Z ) which c onver ges we akly in M 1 ( Z ) to some P 0 . 5.2 A Coun terexample Theorem 3.1 sho ws that, for a fixe d regularization parameter λ ∈ (0 , ∞ ) , the sequence of SVM-estimators S n : ( X × Y ) n → H , D n 7→ f L,D n ,λ , n ∈ N , is qualitativ ely robust. The following proposition shows that, under ex- tremely mild conditions, the sequence of estimators T n : ( X × Y ) n → H , D n 7→ f L,D n ,λ n , n ∈ N , c annot b e qualitatively robust if the fixed parameter λ is replaced b y a se- quence ( λ n ) n ∈ N ⊂ (0 , ∞ ) with lim n →∞ λ n = 0 . This shows that the asymp- totic results on universal c onsistency of supp ort vector machines – which consider appropriate null sequences ( λ n ) n ∈ N ⊂ (0 , ∞ ) – are in conflict with qualitativ e robustness of supp ort v ector machines using λ n . (Asymptotic results on universal c onsistency of supp ort vector mac hines can b e found, e.g., in the references listed b efore Theorem 3.2.) F or simplicity , the following prop osition fo cuses on regression b ecause it is assumed that { 0 , 1 } ⊂ Y . A similar prop osition (with a similar pro of ) can also b e given in case of binary classification where Y = {− 1 , 1 } . 13 Prop osition 5.2 L et X b e a Polish sp ac e and let Y b e a close d subset of R such that { 0 , 1 } ⊂ Y . L et k b e a b ounde d kernel with RKHS H . L et L b e a c onvex loss function such that L ( x, y , y ) = 0 for every ( x, y ) ∈ X × Y . In addition, assume that ther e ar e x 0 , x 1 ∈ X such that ∃ ˜ f ∈ H : ˜ f ( x 0 ) = 0 , ˜ f ( x 1 ) 6 = 0 (6) L ( x 1 , 1 , 0) > 0 . (7) L et ( λ n ) n ∈ N ⊂ (0 , ∞ ) b e any se quenc e such that lim n →∞ λ n = 0 . Then, the se quenc e of estimators T n : ( X × Y ) n → H , D n 7→ f L,D n ,λ n , n ∈ N , is not qualitatively r obust. 5.3 Pro ofs In order to pro ve the main theorem, i.e. Theorem 3.1, w e ha v e to pro ve Theorem 3.2 and Corollary 3.4 at first. Pro of of Theorem 3.2: Since the pro of is somewhat in volv ed, we start with a short outline. The pro of is divided into four parts. Part 1 is concerned with some imp ortan t preparations. W e hav e to sho w that ( f L ∗ , P n ,λ ) n ∈ N con verges to f L ∗ , P 0 ,λ in H if the sequence of probabilit y mea- sures (P n ) n ∈ N w eakly conv erges to the probability measure P 0 . Let us now assume that there is a subsequence ( f L ∗ , P n ` ,λ ) ` ∈ N of ( f L ∗ , P n ,λ ) n ∈ N whic h w eakly conv erges to f L ∗ , P 0 ,λ in H . Then, it is sho wn in Part 2 and P art 3 that lim ` →∞ R L ∗ , P n ` ( f L ∗ , P n ` ,λ ) = R L ∗ , P 0 ( f L ∗ , P 0 ,λ ) (8) lim ` →∞ R L ∗ , P n ` ,λ ( f L ∗ , P n ` ,λ ) = R L ∗ , P 0 ,λ ( f L ∗ , P 0 ,λ ) . (9) Because of k f k 2 H = 1 λ R L ∗ , P ,λ ( f ) − R L ∗ , P ( f ) ∀ P ∈ M 1 ( X × Y ) ∀ f ∈ H , it follo ws from (8) and (9) that lim ` →∞ k f L ∗ , P n ` ,λ k H = k f L ∗ , P 0 ,λ k H . Since this con v ergence of the norms together with w eak con v ergence in the Hilb ert space H implies (strong) conv ergence in H , we get that the subsequence ( f L ∗ , P n ` ,λ ) ` ∈ N con verges to f L ∗ , P 0 ,λ in H . P art 4 extends this result to the whole sequence ( f L ∗ , P n ,λ ) n ∈ N . The main difficulty in the pro of is the v erification of (8) in P art 3. In order to shorten notation, define L ∗ f : X × Y → R , ( x, y ) 7→ L ∗ x, y , f ( x ) = L ( x, y , f ( x )) − L ( x, y , 0) 14 for every measurable f : X → R . F ollo wing e.g. v an der V aart (1998) and P ollard (2002), we use the notation P g = Z g d P for integrals of real-v alued functions g with resp ect to P . This leads to a v ery efficient notation whic h is more intuitiv e here b ecause, in the following, P rather acts as a linear functional on a function space than as a probability measure on a σ -algebra. By use of these notations, w e ma y write P L ∗ f = Z L ∗ f d P = R L ∗ , P ( f ) for the (shifted) risk of f ∈ H . Accordingly , the (shifted) regularized risk of f ∈ H is R L ∗ , P ,λ ( f ) = R L ∗ , P ( f ) + λ k f k 2 H = P L ∗ f + λ k f k 2 H . Part 1 : Since the loss function L , the shifted loss L ∗ and the regulariza- tion parameter λ ∈ (0 , ∞ ) are fixed, w e may drop them in the notation and write f P := f L ∗ , P ,λ = S (P) ∀ P ∈ M 1 ( X × Y ) . Recall from Theorem 2.1 that f L ∗ , P ,λ is equal to the supp ort v ector machine f L, P ,λ if f L, P ,λ exists. That is, we ha ve f P = f L, P ,λ in the latter case. According to Christmann et al. (2009, (17),(16)), k f P k ∞ ≤ 1 λ | L | 1 · k k k 2 ∞ (10) k f P k H ≤ s 1 λ | L | 1 Z | f P | d P (10) ≤ 1 λ | L | 1 · k k k ∞ . (11) for every P ∈ M 1 ( X × Y ) . Since the k ernel k is contin uous and b ounded, Stein wart and Christmann (2008, Lemma 4.28) yields f ∈ C b ( X ) ∀ f ∈ H . (12) Therefore, con tin uity of L implies con tinuit y of L ∗ f : X × Y → R , ( x, y ) 7→ L x, y , f ( x ) − L ( x, y , 0) for every f ∈ H . F urthermore, the uniform Lipsc hitz property of L implies sup x,y L ∗ f ( x, y ) = sup x,y L ( x, y , f ( x )) − L ( x, y , 0) ≤ sup x 0 ,x,y L ( x, y , f ( x 0 )) − L ( x, y , 0) ≤ sup x 0 | L | 1 · f ( x 0 ) − 0 = | L | 1 k f k ∞ 15 for ev ery f ∈ H . Hence, w e obtain L ∗ f ∈ C b ( X × Y ) ∀ f ∈ H . (13) In particular, the abov e calculation and (10) imply k L ∗ f P k ∞ ≤ 1 λ | L | 2 1 · k k k 2 ∞ ∀ P ∈ M 1 ( X × Y ) . (14) F or the remaining parts of the proof, let (P n ) n ∈ N 0 ⊂ M 1 ( X × Y ) be an y fixed sequence such that P n − → P 0 ( n → ∞ ) in the weak top ology on M 1 ( X × Y ) – that is, lim n →∞ P n g = P 0 g ∀ g ∈ C b ( X × Y ) . (15) In particular, (13) and (15) imply lim n →∞ P n L ∗ f = P 0 L ∗ f ∀ f ∈ H . (16) In order to shorten the notation, define f n := f P n = f L ∗ , P n ,λ = S (P n ) ∀ n ∈ N ∪ { 0 } . Hence, w e hav e to show that ( f n ) n ∈ N con verges to f 0 in H – that is, lim n →∞ k f n − f 0 k H = 0 . (17) Part 2 : In this part of the pro of, it is sho wn that lim sup n →∞ P n L ∗ f n + λ k f n k 2 H ≤ P 0 L ∗ f 0 + λ k f 0 k 2 H . (18) Due to (13), the mapping M 1 ( X × Y ) → R , P 7→ P L ∗ f + λ k f k 2 H is defined w ell and contin uous for every f ∈ H . As b eing the (p oint wise) infim um o ver a family of con tinuous functions, the function M 1 ( X × Y ) → R , P 7→ inf f ∈ H P L ∗ f + λ k f k 2 H is upp er semicon tinuous; see, e.g., Denko wski et al. (2003, Prop. 1.1.36). Therefore, the definition of f n implies lim sup n →∞ P n L ∗ f n + λ k f n k 2 H = lim sup n →∞ inf f ∈ H P n L ∗ f + λ k f k 2 H ≤ ≤ inf f ∈ H P 0 L ∗ f + λ k f k 2 H = P 0 L ∗ f 0 + λ k f 0 k 2 H . 16 Part 3 : In this part of the pro of, the following statement is sho wn: Let ( f n ` ) ` ∈ N b e a subsequence of ( f n ) n ∈ N and assume that ( f n ` ) ` ∈ N con verges w eakly in H to some f 0 0 ∈ H . Then, the following three assertions are true: lim ` →∞ P n ` L ∗ f n ` = P 0 L ∗ f 0 0 (19) f 0 0 = f 0 (20) lim ` →∞ k f n ` − f 0 k H = 0 . (21) In order to pro ve this, w e will also ha v e to deal with subsequences of the sub- sequence ( f n ` ) ` ∈ N . As this w ould lead to a somewhat cum b ersome notation, w e define P 0 ` := P n ` and f 0 ` := f n ` ` ∈ N . Th us, f 0 ` = f L ∗ , P n ` ,λ for every ` ∈ N . Then, the assumption of weak con- v ergence in the Hilb ert space H equals lim ` →∞ h f 0 ` , h i H = h f 0 0 , h i H ∀ h ∈ H . (22) First of all, w e sho w (19) by proving lim sup ` →∞ P 0 ` L ∗ f 0 ` − P 0 L ∗ f 0 0 ≤ ε 0 (23) for ev ery fixed ε 0 > 0. In order to do this, fix an y ε 0 > 0 and define ε := ε 0 | L | 1 · 1 λ | L | 1 · k k k 2 ∞ + k f 0 0 k ∞ > 0 . (24) The follo wing calculation shows that the sequence of functions ( f 0 ` ) ` ∈ N is uniformly con tinuous on X . F or an y conv ergen t sequence x m → x 0 in X , w e ha ve lim sup m →∞ sup ` ∈ N f 0 ` ( x m ) − f 0 ` ( x 0 ) = lim sup m →∞ sup ` ∈ N h f 0 ` , Φ( x m ) i H − h f 0 ` , Φ( x 0 ) i H = lim sup m →∞ sup ` ∈ N h f 0 ` , Φ( x m ) − Φ( x 0 ) i H ≤ lim sup m →∞ sup ` ∈ N k f 0 ` k H · k Φ( x m ) − Φ( x 0 ) k H (11) ≤ 1 λ | L | 1 · k k k ∞ · lim sup m →∞ k Φ( x m ) − Φ( x 0 ) k H = 0 where the first equality follows from the prop erties of the RKHS H and the last equalit y follows from Stein wart and Christmann (2008, Lemma 4.29). 17 Since X × Y is a P olish space, weak con vergence of (P 0 ` ) ` ∈ N implies uniform tigh tness of (P 0 ` ) ` ∈ N (see e.g. Dudley (1989, Theorem 11.5.3)). That is, there is a compact subset K ε ⊂ X × Y suc h that lim sup ` →∞ P 0 ` K c ε < ε . (25) Since K ε is compact and the pro jection τ X : X × Y → X , ( x, y ) 7→ x is contin uous, ˜ K ε := τ X ( K ε ) is compact in X . F or ev ery ` ∈ N 0 , the restriction of f 0 ` on ˜ K ε is denoted b y ˜ f 0 ` . As the sequence ( f 0 ` ) ` ∈ N is uniformly con tinuous on X and uniformly b ounded in C b ( X ) (see (10)), the sequence of the restrictions ( ˜ f 0 ` ) ` ∈ N has the corresp onding prop erties on ˜ K ε . That is, ( ˜ f 0 ` ) ` ∈ N is uniformly con tinuous on ˜ K ε and uniformly b ounded in C b ( ˜ K ε ) . Hence, the Arzela-Ascoli-Theorem – see Conw a y (1985, Theorem VI.3.8) – assures that ( ˜ f 0 ` ) ` ∈ N is totally bounded and, therefore, relativ ely compact in C b ( ˜ K ε ) (since C b ( ˜ K ε ) is a complete metric space); see e.g. Dunford and Sc hw artz (1958, Theorem I.6.15). The following reasoning shows that ( ˜ f 0 ` ) ` ∈ N con verges to ˜ f 0 0 in C b ( ˜ K ε ) , i.e. lim ` →∞ sup x ∈ ˜ K ε f 0 ` ( x ) − f 0 0 ( x ) = 0 . (26) W e will show (26) by contradiction. If (26) is not true, then there is a δ > 0 and a subsequence ( ˜ f 0 ` j ) j ∈ N suc h that sup x ∈ ˜ K ε f 0 ` j ( x ) − f 0 0 ( x ) > δ ∀ j ∈ N . (27) Relativ e compactness of ( ˜ f 0 ` ) ` ∈ N implies that there is a further subsequence ( ˜ f 0 ` j m ) m ∈ N whic h con verges in C b ( ˜ K ε ) to some ˜ h 0 ∈ C b ( ˜ K ε ) . Then, ˜ h 0 ( x ) = lim m →∞ ˜ f 0 ` j m ( x ) = lim m →∞ f 0 ` j m ( x ) = lim m →∞ h f 0 ` j m , Φ( x ) i H = (22) = h f 0 0 , Φ( x ) i H = f 0 0 ( x ) = ˜ f 0 0 ( x ) . for ev ery x ∈ ˜ K ε . That is, ˜ f 0 0 is the limit of ( ˜ f 0 ` j m ) m ∈ N – whic h is the desired con tradiction to (27). Therefore, (26) is true. No w, we can pro ve (23): Firstly , the triangle inequality and the Lipsc hitz 18 con tinuit y of L yield lim sup ` →∞ P 0 ` L ∗ f 0 ` − P 0 L ∗ f 0 0 ≤ lim sup ` →∞ P 0 ` L ∗ f 0 ` − P 0 ` L ∗ f 0 0 + P 0 ` L ∗ f 0 0 − P 0 L ∗ f 0 0 (16) = lim sup ` →∞ P 0 ` L ∗ f 0 ` − P 0 ` L ∗ f 0 0 = lim sup ` →∞ Z L ( x, y , f 0 ` ( x )) − L ( x, y , f 0 0 ( x )) d P 0 ` ≤ lim sup ` →∞ Z | L | 1 · f 0 ` ( x ) − f 0 0 ( x ) P 0 ` ( d ( x, y )) = = | L | 1 · lim sup ` →∞ Z K ε f 0 ` ( x ) − f 0 0 ( x ) P 0 ` ( d ( x, y )) + + Z K c ε f 0 ` ( x ) − f 0 0 ( x ) P 0 ` ( d ( x, y )) ! . Secondly , using ˜ K ε = τ X ( K ε ) , w e obtain lim sup ` →∞ Z K ε f 0 ` ( x ) − f 0 0 ( x ) P 0 ` ( d ( x, y )) ≤ lim sup ` →∞ sup ( x,y ) ∈ K ε f 0 ` ( x ) − f 0 0 ( x ) = lim sup ` →∞ sup x ∈ ˜ K ε f 0 ` ( x ) − f 0 0 ( x ) (26) = 0 . Thirdly , lim sup ` →∞ Z K c ε f 0 ` ( x ) − f 0 0 ( x ) P 0 ` ( d ( x, y )) ≤ lim sup ` →∞ P 0 ` K c ε · k f 0 ` k ∞ + k f 0 0 k ∞ (25) ≤ lim sup ` →∞ ε · k f 0 ` k ∞ + k f 0 0 k ∞ (10) , (24) = ε 0 | L | 1 . Com bining these three calculations prov es (23). Since ε 0 > 0 w as arbitrarily c hosen in (23), this pro ves (19). Next, w e prov e (20): Due to w eak con vergence of ( f n ` ) ` ∈ N in H , it follo ws from Con w ay (1985, Exercise V.1.9) that k f 0 0 k H ≤ lim inf ` →∞ k f n ` k H . (28) Therefore, the definition of f 0 = f L ∗ , P 0 ,λ implies P 0 L ∗ f 0 + λ k f 0 k 2 H = inf f ∈ H P 0 L ∗ f + λ k f k 2 H ≤ P 0 L ∗ f 0 0 + λ k f 0 0 k 2 H (19) , (28) ≤ lim inf ` →∞ P n ` L ∗ f n ` + λ k f n ` k 2 H ≤ lim sup ` →∞ P n ` L ∗ f n ` + λ k f n ` k 2 H (18) ≤ P 0 L ∗ f 0 + λ k f 0 k 2 H . 19 Due to this calculation, it follo ws that P 0 L ∗ f 0 + λ k f 0 k 2 H = inf f ∈ H P 0 L ∗ f + λ k f k 2 H = P 0 L ∗ f 0 0 + λ k f 0 0 k 2 H (29) and P 0 L ∗ f 0 + λ k f 0 k 2 H = lim ` →∞ P n ` L ∗ f n ` + λ k f n ` k 2 H . (30) According to Theorem 2.1, f 0 = f L ∗ , P 0 ,λ is the unique minimizer of the function H → R , f 7→ P 0 L ∗ f + λ k f k 2 H and, therefore, (29) implies f 0 = f 0 0 – i.e. (20). Completing P art 3 of the pro of, (21) is sho wn no w: lim ` →∞ k f n ` k 2 H = lim ` →∞ 1 λ P n ` L ∗ f n ` + λ k f n ` k 2 H − P n ` L ∗ f n ` (19) , (30) = 1 λ P 0 L ∗ f 0 + λ k f 0 k 2 H − P 0 L ∗ f 0 = k f 0 k 2 H . By assumption, the sequence ( f n ` ) ` ∈ N con verges w eakly to some f 0 0 ∈ H and b y (20), we know that f 0 0 = f 0 . In addition, w e hav e pro ven lim ` →∞ k f n ` k H = k f 0 k H no w. This conv ergence of the norms together with weak con vergence implies strong con v ergence in the Hilb ert space H , – see, e.g., Con wa y (1985, Exercise V.1.8). That is, w e ha ve prov en (21). Part 4 : In this final part of the pro of, (17) is shown. This is done by con tradiction: If (17) is not true, there is an ε > 0 and a subsequence ( f n ` ) ` ∈ N of ( f n ) n ∈ N suc h that k f n ` − f 0 k H > ε ∀ ` ∈ N (31) According to (11) , ( f n ` ) ` ∈ N = ( f P n ` ) ` ∈ N is b ounded in H . Hence, the sequence ( f n ` ) ` ∈ N con tains a further subsequence that weakly con verges in H to some f 0 0 ; see e.g. Dunford and Sch w artz (1958, Corollary IV.4.7). Without loss of generality , we may therefore assume that ( f n ` ) ` ∈ N w eakly con verges in H to some f 0 0 . (Otherwise, we can choose another subsequence in (31)). Next, it follows from Part 3, that ( f n ` ) ` ∈ N strongly conv erges in H to f 0 – whic h is a con tradiction to (31). 2 Pro of of Corollary 3.4: Let ( D n,m ) m ∈ N b e a sequence in ( X × Y ) n whic h con v erges to some D n, 0 ∈ ( X ×Y ) n . Then, the corresp onding sequence of empirical measures P D n,m m ∈ N w eakly con verges in M 1 ( X × Y ) to P D n, 0 . Therefore, the statement follows from Theorem 3.2 and (4). 2 Based on Cuev as (1988), the main theorem essentially is a consequence of Theorem 3.2. 20 Pro of of Theorem 3.1: According to Corollary 3.4, the SVM-estimator S n : ( X × Y ) n → H , D n 7→ f L,D n ,λ is contin uous and, therefore, measurable with resp ect to the Borel- σ -algebras for ev ery n ∈ N . The mapping S : M 1 ( X × Y ) → H , P 7→ f L ∗ , P ,λ is a contin uous functional due to Theorem 3.2. F urthermore, S n ( D n ) = S P D n ∀ D n ∈ ( X × Y ) n ∀ n ∈ N . As already men tioned in Section 2, H is a separable Hilb ert space and, therefore, a Polish space. Hence, the sequence of SVM-estimators ( S n ) n ∈ N is qualitativ ely robust according to Cuev as (1988, Theorem 2). 2 Pro of of Corollary 3.6: Let P D n denote the function which maps ω ∈ Ω to the empirical measure 1 n P n i =1 δ ( X i ( ω ) ,Y i ( ω )) . According to V aradara- jan’s Theorem (Dudley (1989, Theorem 11.4.1)), there is a set N ∈ A such that Q( N ) = 0 and P D n ( ω ) w eakly con verges to P for every ω ∈ Ω \ N . Then, Theorem 3.2 implies lim n →∞ k f L ∗ , D n ( ω ) ,λ − f L ∗ , P ,λ k H (4) = lim n →∞ k S ( P D n ( ω ) ) − S (P) k H = 0 for every ω ∈ Ω \ N . This pro ves (a) and, due to Stein wart and Christmann (2008, Lemma 4.28), (b). The Lipsc hitz contin uity of L ∗ implies R L ∗ , P ( f L ∗ , D n ( ω ) ,λ ) − R L ∗ , P ( f L ∗ , P ,λ ) = Z L ( x, y , f L ∗ , D n ( ω ) ,λ ( x )) − L ( x, y , f L ∗ , P ,λ ( x )) P( d ( x, y )) ≤ Z sup x 0 ,y 0 L ( x 0 , y 0 , f L ∗ , D n ( ω ) ,λ ( x )) − L ( x 0 , y 0 , f L ∗ , P ,λ ( x )) P( d ( x, y )) ≤ Z | L | 1 · f L ∗ , D n ( ω ) ,λ ( x ) − f L ∗ , P ,λ ( x ) P( d ( x, y )) ≤ | L | 1 · f L ∗ , D n ( ω ) ,λ − f L ∗ , P ,λ ∞ for ev ery ω ∈ Ω . According to (b), the last term con verges to 0 for Q - almost ev ery ω ∈ Ω and this implies (d). Finally , (c) follows from (a) and (d). If f L, P ,λ exists, then f L ∗ , P ,λ is equal to f L, P ,λ (Theorem 2.1). In particu- lar, there is an f ∈ H such that ( x, y ) 7→ L ( x, y , f ( x )) is P - integrable. Since Lipsc hitz-contin uit y of L and H ⊂ C b ( X ) (see Steinw art and Christmann (2008, Lemma 4.28)) implies P - integrabilit y of ( x, y ) 7→ L ∗ ( x, y , f ( x )) = L ( x, y , f ( x )) − L ( x, y , 0) , we get that ( x, y ) 7→ L ( x, y , 0) is also P - in tegrable. 21 Therefore, R L ∗ , P ( f ) is equal to R L, P ( f ) − R L, P (0) for ev ery f ∈ H , and R L, P (0) is a finite constant which do es not dep end on f . F urthermore, f L ∗ ,D n ,λ = f L,D n ,λ for every D n ∈ ( X × Y ) n ; see Section 2. Hence, the original assertions (a)–(d) for L ∗ turn in to the corresponding assertions for L instead of L ∗ . 2 Pro of of Theorem 5.1: If Ψ = 0 , the statemen t is true. Assume Ψ 6 = 0 no w and assume that the statement of the theorem is not true. Then, there is an ε > 0 and a subsequence (P n ` ) ` ∈ N suc h that Z Ψ d P n l − Z Ψ d P 0 H > ε ∀ ` ∈ N . (32) Since the sequence (P n ) n ∈ N w eakly conv erges to P 0 , it is uniformly tigh t; see, e.g., (Dudley, 1989, Theorem 11.5.3). That is, there is a compact subset K ⊂ Z such that P n ` Z \ K < ε 4 sup z k Ψ( z ) k H ∀ ` ∈ N 0 . (33) F or ev ery ` ∈ N , let ˜ P n ` denote the restriction of P n ` to the Borel- σ -algebra B ( K ) of K . Let ˜ Ψ denote the restriction of Ψ to K . Since K is a compact P olish space, the set M ( K ) of all finite signed measures on B ( K ) is the dual space of C ( K ) (the set of all contin uous functions f : K → R ); see e.g. (Dudley, 1989, Theorem 7.1.1 and 7.4.1). Accordingly , M ( K ) is precisely the set of all (real) measures in the sense of (Bourbaki, 2004, Section II I.1); see also (Bourbaki, 2004, Subsection I I I.1.5 and I II.1.8). Since ( ˜ P n ` ) ` ∈ N is relativ ely compact in the v ague top ology of M ( K ) (Bourbaki, 2004, Sub- section II I.1.9), we may assume without loss of generality that ( ˜ P n ` ) ` ∈ N v aguely conv erges to some p ositiv e finite measure ˜ P 0 0 . (Otherwise, w e may replace ( ˜ P n ` ) ` ∈ N b y a further subsequence.) According to (Bourbaki, 2004, p. II I.40), v ague con v ergence implies Z ˜ Ψ d ˜ P n ` − → Z ˜ Ψ d ˜ P 0 0 ( ` → ∞ ) (34) for P ettis and Bochner integrals (since H is assumed to b e a separable Ba- nac h space, Pettis in tegrals and Bochner in tegrals coincide; see e.g. (Dudley, 1989, p. 150)). Let H ∗ b e the dual space of H . Note that F ◦ Ψ is contin uous and b ounded on Z for every F ∈ H ∗ . Hence, it follows from weak con vergence of (P n ` ) ` ∈ N to P 0 and a prop ert y of the Bo c hner integral (Denko wski et al., 2003, Theorem 3.10.16) that lim ` →∞ F Z Ψ d P n ` = lim ` →∞ Z F ◦ Ψ d P n ` = Z F ◦ Ψ d P 0 = F Z Ψ d P 0 . 22 Accordingly , v ague conv ergence of ( ˜ P n ` ) ` ∈ N to ˜ P 0 0 implies lim ` →∞ F R ˜ Ψ d ˜ P n ` = F R ˜ Ψ d ˜ P 0 0 . Hence, lim ` →∞ F Z Ψ d P n ` − Z ˜ Ψ d ˜ P n ` = F Z Ψ d P 0 − Z ˜ Ψ d ˜ P 0 0 . (35) F or every ` ∈ N , Z Ψ d P n ` − Z ˜ Ψ d ˜ P n ` H = Z Z \ K Ψ d P n ` H ≤ Z Z \ K k Ψ k H d P n ` (33) ≤ ε 4 . (36) F or every ` ∈ N and every F ∈ H ∗ suc h that k F k H ∗ ≤ 1, (36) implies F R Ψ d P n ` − R ˜ Ψ d ˜ P n ` ≤ ε 4 and, b ecause of (35), also F R Ψ d P n ` − R ˜ Ψ d ˜ P n ` ≤ ε 4 . He nce, it follows from (Dunford and Sch w artz, 1958, Corollary II.3.15) that Z Ψ d P 0 − Z ˜ Ψ d ˜ P 0 0 H ≤ ε 4 . (37) By using the triangle inequality , we obtain Z Ψ d P n ` − Z Ψ d P 0 H ≤ Z Ψ d P n ` − Z ˜ Ψ d ˜ P n ` H + Z ˜ Ψ d ˜ P n ` − Z ˜ Ψ d ˜ P 0 0 H + Z ˜ Ψ d ˜ P 0 0 − Z Ψ d P 0 H , so that (34), (36) and (37) imply lim sup ` →∞ R Ψ d P n ` − R Ψ d P 0 H ≤ ε 2 . This is a con tradiction to (32). 2 Pro of of Prop osition 5.2: Without loss of generality , we may assume that ˜ f ( x 0 ) = 0 and ˜ f ( x 1 ) = 1 . (38) (Otherwise, w e can divide ˜ f b y ˜ f ( x 1 ) .) Since the function R → [0 , ∞ ) , t 7→ L ( x 1 , 1 , t ) is con v ex, it is also contin uous. Therefore, (7) implies the existence of an γ ∈ (0 , 1) suc h that L ( x 1 , 1 , γ ) > 0 . (39) Note that con v exity of the loss function, L ( x 1 , 1 , 1) = 0 and L ( x 1 , 1 , γ ) > 0 imply 0 = L ( x 1 , 1 , 1) ≤ L ( x 1 , 1 , t ) < L ( x 1 , 1 , γ ) ≤ L ( x 1 , 1 , s ) (40) for 0 ≤ s ≤ γ < t ≤ 1 . Define P 0 := δ ( x 0 , 0) . Since f L,δ ( x 0 , 0) ,λ n = 0 , it follo ws that P n 0 D n ∈ ( X × Y ) n f L,D n ,λ n = 0 = 1 . (41) 23 Next, fix any ε ∈ (0 , 1) and define the mixture distribution P ε := (1 − ε )P 0 + εδ ( x 1 , 1) = (1 − ε ) δ ( x 0 , 0) + εδ ( x 1 , 1) . F or every n ∈ N , let Z 0 n b e the subset of ( X × Y ) n whic h consists of all those elemen ts D n = D (1) n , . . . , D ( n ) n ∈ ( X × Y ) n where D ( i ) n ∈ ( x 0 , 0) , ( x 1 , 1) ∀ i ∈ { 1 , . . . , n } . In addition, let Z 00 n b e the subset of ( X × Y ) n whic h consists of all those elemen ts D n = D (1) n , . . . , D ( n ) n ∈ ( X × Y ) n where ] i ∈ { 1 , . . . , n } D ( i ) n = ( x 1 , 1) ≥ ε 2 . (42) Define Z n := Z 0 n ∩ Z 00 n . Then, we hav e P n ε ( Z 0 n ) = 1 and, according to the la w of large n umbers (Dudley (1989, Theorem 8.3.5)), lim n →∞ P n ε ( Z 00 n ) = 1 . Hence, there is an n ε, 1 ∈ N such that P n ε ( Z n ) ≥ 1 2 ∀ n ≥ n ε, 1 . (43) Due to lim n →∞ λ n = 0 and (39), there is an n ε, 2 ∈ N such that λ n k ˜ f k 2 H < ε 2 L ( x 1 , 1 , γ ) ∀ n ≥ n ε, 2 . (44) In the following, we sho w f L,D n ,λ n ( x 1 ) > γ ∀ D n ∈ Z n , ∀ n ≥ n ε, 2 . (45) T o this end, fix an y D n ∈ Z n . In order to prov e (45), it is enough to sho w the follo wing assertion for ev ery n ≥ n ε, 2 : f ∈ H , f ( x 1 ) ≤ γ ⇒ R L,D n ,λ n ( ˜ f ) ≤ R L,D n ,λ n ( f ) . (46) The definition of Z n and (38) imply R L,D n ,λ n ( ˜ f ) = R L,D n ( ˜ f ) + λ n k ˜ f k 2 H = λ n k ˜ f k 2 H . F or every f ∈ H suc h that f ( x 1 ) ≤ γ , the definition of Z n implies R L,D n ,λ n ( f ) ≥ R L,D n ( f ) (42) ≥ ε 2 L x 1 , 1 , f ( x 1 ) (40) ≥ ε 2 L x 1 , 1 , f ( x 1 ) . Hence, (46) follows from (44) and, therefore, w e hav e pro ven (45). Define n ε = max { n ε, 1 , n ε, 2 } . By assumption, k is a b ounded, non-zero k ernel. According to Steinw art and Christmann (2008, Lemma 4.23), this implies k f L,D n ,λ n k H ≥ k f L,D n ,λ n k ∞ k k k ∞ (45) ≥ γ k k k ∞ ∀ D n ∈ Z n , ∀ n ≥ n ε 24 and, therefore, k f L,D n ,λ n k H ≥ min γ k k k ∞ , 1 =: c ∀ D n ∈ Z n , ∀ n ≥ n ε . (47) Define F := { f ∈ H | k f k H ≥ c } and F c 2 := f ∈ H inf f 0 ∈ H k f − f 0 k H ≤ c 2 ⊂ f ∈ H k f k H > 0 . (48) Hence, for every n ≥ n ε , w e obtain T n (P n ε ) ( F ) = P n ε D n k f L,D n ,λ n k H ≥ c (47) ≥ P n ε ( Z n ) (43) ≥ 1 2 (47) ≥ c 2 (41) = P n 0 D n k f L,D n ,λ n k H > 0 + c 2 = T n (P n 0 ) f ∈ H k f k H > 0 + c 2 (48) ≥ T n (P n 0 ) F c 2 + c 2 . According to the definition of the Prokhoro v distance (see Subsection 5.1), it follo ws that sup n ∈ N d Pro T n (P n 0 ) , T n (P n ε ) ≥ c 2 (49) In addition, w e hav e d Pro P 0 , P ε ≤ ε b ecause P ε is an ε -mixture of P 0 . Since c > 0 do es not depend on ε ∈ (0 , 1) and ε may b e arbitrarily small, this pro v es that ( T n ) n ∈ N is not qualitatively robust in P 0 . 2 References P . L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Conv exit y , classification, and risk b ounds. Journal of the Americ an Statistic al Asso ciation , 101: 138–156, 2006. P . Billingsley . Conver genc e of pr ob ability me asur es . John Wiley & Sons, New Y ork, 1968. N. Bourbaki. Inte gr ation. I. Chapters 1–6 . Springer-V erlag, Berlin, 2004. T ranslated from the 1959, 1965 and 1967 F rench originals by Sterling K. Berb erian. O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Ma- chine L e arning R ese ar ch , 2:499–526, 2002. A. Christmann and I. Stein w art. Consistency and robustness of k ernel-based regression in conv ex risk minimization. Bernoul li , 13(3):799–819, 2007. 25 A. Christmann and A. V an Messem. Bouligand deriv ativ es and robustness of supp ort vector machines for regression. Journal of Machine L e arning R ese ar ch , 9:915–936, 2008. A. Christmann, A. V an Messem, and I. Steinw art. On consistency and robustness prop erties of supp ort vector machines for heavy-tailed distri- butions. Statistics and Its Interfac e , 2:311–327, 2009. J. B. Conw a y . A c ourse in functional analysis . Springer-V erlag, New Y ork, 1985. A. Cuev as. Qualitative robustness in abstract inference. Journal of Statis- tic al Planning and Infer enc e , 18:277–289, 1988. E. De Vito, L. Rosasco, A. Cap onnetto, M. Piana, and A. V erri. Some prop erties of regularized k ernel metho ds. Journal of Machine L e arning R ese ar ch , 5:1363–1390, 2004. Z. Denk owski, S. Mig´ orski, and N. P apageorgiou. An intr o duction to non- line ar analysis: The ory . Kluw er Academic Publishers, Boston, 2003. R. Dudley . R e al analysis and pr ob ability . W adsw orth & Bro oks/Cole Ad- v anced Bo oks & Soft ware, Pacific Grov e, CA, 1989. N. Dunford and J. Sch w artz. Line ar op er ators. I. Gener al the ory . Wiley- In terscience Publishers, New Y ork, 1958. J. Hadamard. Sur les probl` emes aux d ´ eriv´ ees partielles et leur signification ph ysique. Princ eton University Bul letin , 13:49–52, 1902. F. R. Hamp el. Contributions to the the ory of r obust estimation . PhD thesis, Univ ersity of California, Berkeley , 1968. F. R. Hamp el. A general qualitative definition of robustness. Annals of Mathematic al Statistics , 42:1887–1896, 1971. P . J. Huber. Robust estimation of a lo cation parameter. A nnals of Mathe- matic al Statistics , 35:73–101, 1964. P . J. Hub er. The behavior of maximum lik eliho o d estimates under non- standard conditions. In Pr o c e e dings of the Fifth Berkeley Symp osium on Mathematic al Statistics and Pr ob ability, Vol. I: Statistics , pages 221–233. Univ ersity California Press, Berkeley , 1967. P . J. Hub er. R obust statistics . John Wiley & Sons, New Y ork, 1981. D. Pollard. A user’s guide to me asur e the or etic pr ob ability . Cambridge Univ ersity Press, Cam bridge, 2002. 26 B. Sc h ¨ olk opf and A. J. Smola. L e arning with kernels . MIT Press, Cam bridge, 2002. I. Steinw art. Supp ort vector mac hines are universally consisten t. Journal of Complexity , 18:768–791, 2002. I. Stein wart. Sparseness of supp ort v ector machines. Journal of Machine L e arning R ese ar ch , 4:1071–1105, 2003. I. Steinw art. Consistency of supp ort v ector machines and other regularized k ernel classifiers. IEEE T r ansactions on Information The ory , 51:128–142, 2005. I. Stein wart and M. Anghel. Consistency of supp ort v ector machines for forecasting the evolution of an unknown ergo dic dynamical system from observ ations with unknown noise. A nnals of Statistics , 37:841–875, 2009. I. Stein wart and A. Christmann. Supp ort ve ctor machines . Springer, New Y ork, 2008. J. T ukey . A survey of sampling from contaminated distributions. In Contri- butions to pr ob ability and statistics , pages 448–485. Stanford Univ. Press, Stanford, Calif., 1960. A. v an der V aart. Asymptotic statistics . Cam bridge Universit y Press, Cam- bridge, 1998. V. N. V apnik. Statistic al le arning the ory . John Wiley & Sons, New Y ork, 1998. T. Zhang. Con vergence of large margin separable linear classification. In T. K. Leen, T. G. Dietteric h, and V. T resp, editors, A dvanc es in Neur al In- formation Pr o c essing Systems 13 , pages 357–363. MIT Press, Cam bridge, MA, 2001. T. Zhang. Statistical b eha vior and consistency of classification metho ds based on conv ex risk minimization. Annals of Statistics , 32:56–85, 2004. 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment