Probabilistic Kernel Support Vector Machines
We propose a probabilistic enhancement of standard kernel Support Vector Machines for binary classification, in order to address the case when, along with given data sets, a description of uncertainty (e.g., error bounds) may be available on each dat…
Authors: Yongxin Chen, Tryphon T. Georgiou, Allen R. Tannenbaum
Probabilistic K ernel Support V ector Machines Y ongxin Chen, T ryphon T . Georgiou, and Allen R. T annenbaum Abstract — W e propose a pr obabilistic enhancement of stan- dard kernel Support V ector Machines for binary classification, in order to address the case when, along with giv en data sets, a description of uncertainty (e.g ., error bounds) may be available on each datum. In the present paper , we specifically consider Gaussian distributions to model uncertainty . Ther eby , our data consist of pairs ( x i , Σ i ) , i ∈ { 1 , . . . , N } , along with an indicator y i ∈ {− 1 , 1 } to declare membership in one of two categories for each pair . These pairs may be viewed to represent the mean and covariance, respectively , of random vectors ξ i taking values in a suitable linear space (typically R n ). Thus, our setting may also be viewed as a modification of Support V ector Machines to classify distrib utions, albeit, at pr esent, only Gaussian ones. W e outline the f ormalism that allows computing suitable classifiers via a natural modification of the standard “kernel trick. ” The main contribution of this work is to point out a suitable k ernel function for applying Support V ector techniques to the setting of uncertain data for which a detailed uncertainty description is also av ailable (herein, “Gaussian points”). I . I N T RO D U C T I O N A N D M OT I V A T I O N The Support V ector methodology is a nonlinear general- ization of Linear Classification, typically utilized for binary classification. In this methodology , a training set of data becomes av ailable where each datum belongs to one of two categories while an indicator function identifies the category for each. Imagine points on a plane labeled with two different colors. Solely based on their location and proximity to others of particular colors, one needs to classify new points and assign them to one of the two categories. When the points of the training set in the two categories can be separated by a line, then this line can be chosen to separate two respectiv e regions, one corresponding to each of the two categories. On the other hand, when the “training” points of the two categories cannot be separated by a line, then a higher order curve needs to be chosen. The Support V ector methodology provides a rigorous and systematic way to do exactly that. In the case where no simple boundary can be chosen to delineate regions for the points in the two categories, and a simple classification rule is still sought based on location, then a smooth boundary can be selected to define regions that contain the vast majority of points in the respectiv e categories. Thus, simplicity of the classification rule (Occam’ s razor) may be desirable, as the domains corresponding to Y . Chen is with the School of Aerospace Engineering, Georgia T ech, Atlanta, GA; yongchen@gatech.edu T .T . Georgiou is with the Department of Mechanical and Aerospace Engineering, UCI, Irvine, CA; tryphon@uci.edu A.R. T annenbaum is with Departments of Computer Science and Applied Mathematics & Statistics, Stony Brook Univ ersity , NY ; arober- tan@cs.stonybrook.edu the two categories may not be completely separate and a misclassification error in that case must be deemed acceptable. This again can be handled by a suitable relaxation of Support V ector Machines (SVM). At present the rather extensi ve literature on the subject continues to expand unabated, and the same applies to the growing library of options with reg ard to formulations and related methodologies [1]–[3], as well as to applications in system identification and decision making [4, Chapter 15]. Initially , algorithms were developed assuming that sam- ples are error-free. Howe ver , evidently , in many practical applications such an assumption is unrealistic. T o this end, a number of approaches have been proposed. For a survey and other directions in de veloping SVM’ s designed to address uncertain data sets, due to sampling, modeling, or instrumentation errors, or class label uncertainty , see [5]–[8] and the references therein. The purpose of this paper is to add to this growing literature by providing yet another angle to how one can incorporate uncertainty in data sets to SVMs. Specifically , we consider the paradigm where measurements are provided along with error bars that quantify expected margins. This is the case when uncertain is recorded at the time data is collected (e.g., by keeping track of sensors being used or physical conditions that hinder precision). Alternativ ely , a datum may be an empirical probability distribution, which in turn may be abstracted as the mean along with a cov ariance matrix to keep track of the spread. Either way , we assume that our data points are approximated by a Gaussian distribution. Thus, in such cases, a “datum” is now a point in a space that is more complex than R n . W e seek to devise suitable kernels, and thereby extend the Support V ector methodology accordingly . The non-trivial nature of the problem to devise kernels for data-points on curved metric spaces has been discussed in [9]. Y et, as we will see, by introducing a suitable probability law that couples the dataset and thereby takes adv antage of the a vailable cov ariance information, the case of “Gaussian data points” allows for a rather simple probabilistic kernel (Theorem 1). Thus, we postulate that measurements provide vectors x ∈ R n together with co variance matrices Σ that quantify our uncertainty in the value x being recorded. For the purposes of binary classification, we record such pairs ( x, Σ) along with the information on the category of origin of each, which is given by an indicator value y = ± 1 . A datum with a large variance naturally delineates a larger volume that should be associated to the corresponding category , or at least, impact proportionately the drawing of the separation between the two. Prior state-of-the-art does not do that. Thus, in the present note, we propose a new kernel that allows treating measurement uncertainty as part of the data. Related work: One approach to incorporate data uncer- tainties in SVM is through robust optimization [5]–[7], and many related works [5] focus on linear SVM. In the robust optimization approach, uncertainties are modeled by a hard bound instead of a soft probability bound. The work that is closest to ours is [8] where the authors also defined a kernel for uncertain measurements. Howe ver , their kernel is defined ov er histograms which require discretization for Gaussian measurements, and thus do not scale for high dimensional applications. Below , in Section II, we highlight some background on Support V ector Machines. Section III gi ves our main result that identifies a kernel function for the case of data provided in the form of “Gaussian points. ” Several numerical e xamples are presented in Section IV to illustrate our framework. It follows by a short concluding remark in Section V. I I . B A C KG R O U N D Support V ector Machines (SVMs) constitute a well- established technique for (binary) classification and regression analysis [10]. The main idea is to embed a gi ven (training) data set into a high dimensional space H , of dimension much higher than the native dimension of the base space X and possibly infinite, so that for binary classification, the two classes can be separated with a hyperplane [11]. Effecti vely , this separation hyperplane projects down to the nativ e base space X , where the data points originally reside, as curves/surfaces/etc. that separate the two classes. The imbedding into H is ef fected by a mapping ϕ : x ∈ X 7→ ϕ ( x ) ∈ H , where ϕ ( x ) is referred to as the feature vector . The space H has an inner product structure (Hilbert space) and, naturally , the construction of the separating hyperplane relies on the geometry of H . Howe ver , and most importantly , the map ϕ does not need to be known explicitly and does not need to be applied to the data. All necessary operations, inner product and projections that show up in the classifier and computations, can be ef fecti vely carried out in the base space using the so called “kernel trick. ” Indeed, to accomplish the task and construct the classi- fication rule (via suitable curves/surfaces/etc.), it is suf ficient to kno w the kernel k ( x, y ) := h ϕ ( x ) , ϕ ( y ) i H , (1) with x, y ∈ X . It ev aluates the inner product in the feature space as a function of the corresponding representati ves x, y in the nativ e base space. Thus, the kernel is a biv ariate function which is completely characterized by the property of being positi ve, in the sense that for all x i ∈ X , i ∈ { 1 , . . . , N } , and any corresponding set of values α i ∈ R (as we are interested in real-v alued kernels) N X i,j =1 α i α j k ( x i , x j ) ≥ 0 . Necessity stems from the fact that the left hand side above is h N X i =1 α i ϕ ( x i ) , N X j =1 α j ϕ ( x j ) i H = k N X i =1 α i ϕ ( x i ) k 2 H ≥ 0 . Sufficienc y , in the existence of a feature map ϕ that realizes (1) is a celebrated theorem [12], see also [13, Theorem 7.5.2] and [10, page 30, Theorem 3.1 (Mercer)]. Classification relies on constructing a classifier that is built on a linear functional, when vie wed in H . It is of the form x → sign( h w , ϕ ( x ) i H − b ) , (2) and the value ± 1 aims to differentiate between elements in two categories. The coefficients w ∈ H and b ∈ R are chosen so that { h ∈ H | h w , h i H − b = 0 } is a separating hyperplane of the two subsets S ± = { ϕ ( x i ) | y i = ± 1 } of the compete (training) data set. Once again, ϕ ( · ) does not need to e valuated at any point in the construction, existence of such a map is enough, and it is guaranteed by the positivity of the kernel. The construction of the classifier requires selection of the parameters w = P N i =1 c i y i ϕ ( x i ) ∈ H and b ∈ R . These are chosen either (“hard margin”) to minimize h w , w i H subject to y i ( h w , ϕ ( x i ) i H − b ) ≥ 1 , or , to (“soft margin”) minimize 1 N N X i =1 max { 0 , 1 − y i ( h w , ϕ ( x i ) i H − b ) } + λ h w, w i H , (3) for all av ailable points in the “training set. ” The “hard margin” formulation coincides with the limit of the “soft” formulation as λ → 0 when the two clusters are separable. The dual formulation of (3) becomes the problem of maximizing N X i =1 c i − 1 2 N X i,j =1 y i y j c i c j k ( x i , x j ) , subject to N X i =1 c i y i = 0 , as well as 0 ≤ c i ≤ (2 N λ ) − 1 for all i . The coefficients c i can now be obtained via quadratic programming, and b can be found as b = N X i =1 c i y i k ( x i , x j ) − y j with j corresponding to an index such that 0 < c j < (2 N λ ) − 1 . The classification rule then becomes x → sign( N X i =1 c i y i k ( x i , x ) − b ) , (4) The above follows standard dev elopment, see e.g., [14] as well as the W ikipedia webpage on Support V ector Machines [15]. I I I . C L A S S I FI C A T I O N O F G AU S S I A N P O I N T S The problem we are addressing in the present note is the classification of uncertain data points into one of two categories, i.e., a binary classification as before. Howe ver , a salient feature of our setting is that data are only known with finite accuracy . Uncertainty is modeled in a probabilistic manner . For simplicity , in this paper , we consider only Gaussian points . These consist of pairs ( x i , Σ i ) representing the mean and variance of a normally distributed random vector ξ i . Regardless of the simplicity , Gaussian points are suf ficiently general to cov er many real application since most physical measurements do in volv e random noise that satisfies a Gaussian distribution. Alternativ ely , we may think of the data as points on a manifold of distributions (though, only Gaussian at present). These may represent approximations of empirical distributions that hav e been obtained at various times. An indicator y i is pro vided as usual along with the information of the category that the current datum belongs to. If we regard it as representing a distribution, we postulate that it arose from experiment in volving population y i . W e follo w the standard setting of kernel Support V ector Machines (kSVMs) that was outlined in the background section, which we overlay a probabilistic component. This Pr obabilistic kernel Support V ector Machine (PkSVM) relies on a suitable modification of the kernel. T o this end we consider the set of “data points” Ω := { ( x, Σ) | x ∈ R n , Σ ∈ S + ,n } with S + ,n the cone of non-negati ve definite symmetric matrices in R n × n . W e also consider the family of normally distributed random vectors ξ ∼ N ( x, Σ) for ( x, Σ) ∈ Ω . W e utilize the popular exponential Radial Basis Function (RBF) kernel k ( x, y ) = e − 1 2 σ 2 k x − y k 2 , (5) where the parameter σ > 0 affects the scale of the desired resolution. For any ( x i , Σ i ) ∈ Ω , ( x j , Σ j ) ∈ Ω , we define random vectors ξ i = x i + Σ 1 / 2 i (6a) ξ j = x j + Σ 1 / 2 j (6b) where is a zero-mean Gaussian random vector with unit cov ariance, i.e., ∼ N (0 , I ) . Apparently , ξ i has distribution N ( x i , Σ i ) , i = 1 , 2 . W e then define the kernel κ (( x i , Σ i ) , ( x j , Σ j )) = E { k ( ξ i , ξ j ) } = E { e − 1 2 σ 2 k ξ i − ξ j k 2 } = (2 π ) − n/ 2 Z e − 1 2 σ 2 k x i − x j +Σ 1 / 2 i − Σ 1 / 2 j k 2 − 1 2 k k 2 d = | I + U 2 ij | − 1 / 2 × e − 1 2 σ 2 k x i − x j k 2 ( I + U 2 ij ) − 1 , (7) where U ij = Σ 1 / 2 i − Σ 1 / 2 j σ . (8) W e no w state our main result. Theor em 1: The function κ (( x i , Σ i ) , ( x j , Σ j )) = | I + U 2 ij | − 1 / 2 × e − 1 2 σ 2 k x i − x j k 2 ( I + U 2 ij ) − 1 (9) with U ij in (8) defines a positive kernel on Ω . Pr oof: Consider k ( x, y ) in (5) for x, y ∈ R n . Then, for any collection α i ∈ R and any collection of ( x i , Σ i ) ∈ Ω , for i ∈ { 1 , . . . , N } , N X i,j =1 α i α j κ (( x i , Σ i ) , ( x j , Σ j )) = E { N X i,j =1 α i α j k ( ξ i , ξ j ) } = E {h N X i =1 α i ϕ ( ξ i ) , N X j =1 α j ϕ ( ξ i ) i H } ≥ 0 , with ϕ ( · ) the map to the radial basis functions corresponding to the positi ve kernel (5) . The claim in the theorem follows. The kernel (9) partially encapsulates the topology of the manifold Ω . Indeed, when σ is relati ve large compared with Σ i , Σ j , then κ (( x i , Σ i ) , ( x j , Σ j )) decreases as the difference between Σ i and Σ j increases. Thus, ( x i , Σ i ) and ( x j , Σ j ) become further to each other in the feature space. W ith the probabilistic kernel κ ( · , · ) , we can then formu- late the PkSVM o ver the space of Gaussian data points space Ω . In particular , giv en N data points { ( x i , Σ i ) , y i } , the dual formulation of PkSVM reads max c 1 ,...,c N N X i =1 c i − 1 2 N X i,j =1 y i y j c i c j κ (( x i , Σ i ) , ( x j , Σ j )) (10a) s.t. N X i =1 c i y i = 0 , (10b) 0 ≤ c i ≤ (2 N λ ) − 1 , 1 ≤ i ≤ N . (10c) Once the coefficients { c 1 , c 2 , . . . , c N } are learned, for any point ( x, Σ) ∈ Ω , the classification rule is x → sign( N X i =1 c i y i κ (( x i , Σ i ) , ( x, Σ)) − b ) , (11) where b = N X i =1 c i y i κ (( x i , Σ i ) , ( x j , Σ j )) − y j . (12) It can be seen that when applied to points in Ω having zero uncertainly , i.e., when the corresponding cov ariance matrices are identically 0 , then the Probabilistic kernel Support V ector Machine model reduces to the standard one where points lie in R n . That is, κ (( x i , 0) , ( x j , 0)) = k ( x i , x j ) . Thereby , PkSVM is natural extension of standard SVM, able to account for error and uncertainty that is av ailable and encoded in the data. I V . N U M E R I C A L E X A M P L E In this section, we present a simple e xample to highlight the PkSVM framework we dev eloped. Consider a synthetic dataset generated as follo ws. One cluster , labeled by y i = 1 , consists of 200 data points { x i | 1 ≤ i ≤ 200 } uniformly sampled over a 2D unit disk. The cov ariance for each of these samples is set to be Σ i = Σ L = 0 . 01 0 0 0 . 01 , that is, they are of lo w uncertainty . Another cluster , labeled by y i = − 1 , consists of 200 data points { x i | 201 ≤ i ≤ 400 } uniformly sampled over the annulus { x | 1 ≤ k x k ≤ 2 } . Their cov ariances are set to be Σ i = Σ H = 0 . 09 0 0 0 . 09 , that is, they have higher uncertainties. This dataset Z = { ( x i , Σ i ) , y i | 1 ≤ i ≤ 400 } is depicted in Figure 1. W e train the PkSVM model (10) with the above dataset Z . The parameters are set to be λ = 0 . 001 , σ = 1 . T o illustrate the trained model, we apply it to predict the score of any test data ( x, Σ) ∈ Ω with x ∈ [ − 2 , 2] × [ − 2 , 2] . In particular , we fix Σ and predict the score of ( x, Σ) ov er a grid. When Σ = Σ L , the result is illustrated in Figure 2. When Σ = Σ H , the result is sho wn in Figure 3. The three black curves are contours of the scores corresponding to values -0.5, 0, 0.5 respectively . Recall that the score is always between -1 and 1, and a score closer to 1 (-1) indicates that the data point is more likely to hav e label 1 (-1). As can be observed from Figure 2-3, when Σ = Σ H , the boundary contours tend to go inward the disk. This is consistent with the intuition of the PkSVM framework. Indeed, since Σ = Σ H , with our kernel (9) , the data on the boundary Fig. 1: The intensity of uncertainty of the data points are captured by the size of the circles. The red circles denote the cov ariances of the cluster with label 1. The blue circles denote the cov ariances of the cluster with label -1. The blue data points hav e higher uncertainties. Fig. 2: Prediction contours of testing data with low uncertainty Σ L shows higher similarity with those in the y i = − 1 cluster than those in the y i = 1 cluster . Intuiti vely , the data point is closer to the cluster y i = − 1 in the feature space. Thus, data with higher uncertainty will be more likely to be labeled by -1. Similarly , the boundary contours tend to go outward the disk when Σ = Σ L ; the rationale is exactly the same. For the sake of comparison, we also run a standard SVM (Figure 4) over the same dataset { x i , y i | 1 ≤ i ≤ 400 } without the cov ariances { Σ i } . Note that standard SVM cannot process the extra information introduced by the covariances. W e carried out the same procedures to train the PkSVM model with multiple other datasets. These are illustrated in Figure 5 and 6. In particular , the data points in Figure 5 are generated in the same way as above except for that the cov ariances of the red points are Σ R = 0 . 09 0 0 0 . 01 , Fig. 3: Prediction contours of testing data with low uncertainty Σ H Fig. 4: prediction contours of standard SVM and the cov ariances of the blue points are Σ B = 0 . 01 0 0 0 . 09 . As can be seen from Figure 5, when the uncertainties of the (a) Σ = Σ B (b) Σ = Σ R Fig. 5: Prediction contours with different uncertainties testing data are the same as the blue (red) training data, then the prediction boundary tends to go inward (outward), which is consistent with the intuition behind the PkSVM. Similar observ ations can be drawn in Figure 6 for a different dataset. (a) Σ = Σ B (b) Σ = Σ R Fig. 6: Prediction contours with different uncertainties V . C O N C L U S I O N The formalism herein presents a new paradigm where data incorporate a quantification of their o wn uncertainty . W e focused on binary classification and the case of “Gaussian points” to present a proof-of-concept, in the form of a suitable kernel in Theorem 1. Numerical experiments are provided to illustrate our framew ork. The basic idea appears to be easily generalizable to more detailed and explicit descriptions of uncertainty , e.g., Gaussian Mixture models, with howe ver , the cav eat of added complexity in the resulting formulae. This we expect will be the starting point of future in vestigations. Interestingly , the Gaussian points data structure coincides with the diffusion tensor widely used in diffusion tensor imaging (DTI) [16], [17], an important technique in magnetic resonant imaging (MRI). Thus, another promising direction lies in the interface between our framework and DTI. A C K N O W L E D G M E N T S This research was funded through NSF under grants 1901599, 1807664, 1839441, AFOSR under grant F A9550- 20-1-0029, National Institutes of Health grant RF1 A G053991, and the Breast Cancer Research Foundation. R E F E R E N C E S [1] I. Steinwart and A. Christmann, Support vector machines . Springer Science & Business Media, 2008. [2] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines, ” A CM transactions on intelligent systems and technolo gy (TIST) , vol. 2, no. 3, pp. 1–27, 2011. [3] J. A. Suykens, “Support vector machines: a nonlinear modelling and control perspective, ” Eur opean Journal of Control , vol. 7, no. 2-3, pp. 311–327, 2001. [4] S. Y . Kung, K ernel methods and machine learning . Cambridge Univ ersity Press, 2014. [5] X. W ang and P . M. Pardalos, “ A survey of support vector machines with uncertainties, ” Annals of Data Science , vol. 1, no. 3-4, pp. 293–309, 2014. [6] J. Bi and T . Zhang, “Support vector classification with input data uncertainty , ” in Advances in neural information processing systems , 2005, pp. 161–168. [7] T . B. T rafalis and R. C. Gilbert, “Rob ust support vector machines for classification and computational issues, ” Optimisation Methods and Softwar e , vol. 22, no. 1, pp. 187–198, 2007. [8] Z. Xie, Y . Xu, and Q. Hu, “Uncertain data classification with additiv e kernel support vector machine, ” Data & Knowledge Engineering , vol. 117, pp. 87–97, 2018. [9] A. Feragen, F . Lauze, and S. Hauberg, “Geodesic exponential kernels: When curv ature and linearity conflict, ” in Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2015, pp. 3032–3042. [10] B. Schölkopf, C. J. Burges, A. J. Smola et al. , Advances in kernel methods: support vector learning . MIT press, 1999. [11] T . M. Cover , “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, ” IEEE transactions on electr onic computers , no. 3, pp. 326–334, 1965. [12] N. Aronszajn, “Theory of reproducing kernels, ” T ransactions of the American mathematical society , vol. 68, no. 3, pp. 337–404, 1950. [13] D. Alpay , “ An advanced complex analysis problem book, ” T opological vector spaces, functional analysis, and Hilbert spaces of analytic functions. Birkäuser Basel , 2015. [14] B. Schölkopf and A. J. Smola, Learning with kernels: support vector machines, r e gularization, optimization, and beyond . MIT press, 2001. [15] Support-vector machine , 2020 (accessed January 28, 2020). [Online]. A vailable: https://en.wikipedia.org/wiki/Support- vector_machine [16] D. Le Bihan, J.-F . Mangin, C. Poupon, C. A. Clark, S. Pappata, N. Molko, and H. Chabriat, “Diffusion tensor imaging: concepts and applications, ” Journal of Magnetic Resonance Ima ging: An Of ficial Journal of the International Society for Magnetic Resonance in Medicine , vol. 13, no. 4, pp. 534–546, 2001. [17] H. Farooq, Y . Chen, T . T . Georgiou, and C. Lenglet, “Some geometric ideas for feature enhancement of diffusion tensor fields, ” in 2016 IEEE 55th Confer ence on Decision and Contr ol (CDC) . IEEE, 2016, pp. 3856–3861.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment