Explicit Approximations of the Gaussian Kernel

Explicit Approximations of the Gaussian K ernel Andre w Cotter , Joseph Keshet and Nathan Srebro { C OT T E R , J K E S H E T , NAT I } @ T T I C . E D U T oyota T echnological Institute at Chica go 6045 S. K enwood A ve. Chicago, Illinois 60637, USA Nov ember 27, 2024 Abstract W e in vestigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit ﬁnite- dimensional polynomial feature representation based on the T aylor expansion of the exponential. Although not as efﬁcient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of fea- tures, we show ho w this polynomial representation can provide a better approximation in terms of the computational cost inv olved. This makes our “T aylor features” especially attractiv e for use on very large data sets, in conjunction with online or stochastic training. 1 Intr oduction In recent years sev eral extremely fast methods for training linear support v ector machines ha ve been de veloped. These are generally stochastic (online) methods, which work on one e xample at a time, and for which each step in v olves only simple calculations on a single feature vector: inner products and v ector additions [Shale v-Shwartz et al., 2007, Hsieh et al., 2008]. Such methods are capable of training support vector machines (SVMs) with many millions of examples in a few seconds on a conv entional CPU, essentially eliminating any concerns about training runtime ev en on very large datasets. Meanwhile, fast methods for training kernelized SVMs have lagged behind. State-of-the-art kernel SVM training methods may take days or ev en weeks of con ventional CPU time for problems with a million examples of effecti ve dimension less than 100. While the stochastic methods mentioned abo ve can indeed be kernelized, each iteration then requires the computation of an entire row of the kernel matrix, i.e. the entire data set needs to be considered in each stochastic step. Any Mercer kernel implements an inner-product between a mapping of two input vectors into a high dimensional feature space. In this paper we propose an e xplicit lo w-dimensional approximation to this mapping, which, after being applied to the input data, can be used with an efﬁcient linear SVM solver . The dimension of the approximate map- ping controls the computational difﬁculty and the approximation qualities. The key to choosing a good approximate mapping comes in trading off these considerations. Rahimi and Recht [2007] proposed such a feature representation for the Gaussian kernel (as well as other shift-in v ariant kernels) using random “Fourier” features: each feature (each coordinate in the feature mapping) is a cosine of a random afﬁne projection of the data. In this paper we study an alternativ e simple feature representation approximating the Gaussian kernel: we take a low- order T aylor expansion of the exponential, resulting in features that are scaled monomials in the coordinates of the 1 input vectors. W e focus on the Gaussian kernel, b ut a similar approach could also work for other kernels which depend on distances or inner products between feature vectors, e.g. the sigmoid kernel. At ﬁrst glance it seems that this T aylor feature representation must be inferior to random Fourier features. The theo- retical guarantee on the approximation quality is given by the error of a T aylor series, and is expressed most naturally in terms of the degree of the expansion, of which the number of features is an e xponential function. Indeed, to achie ve the same approximation quality , we need many more T aylor than random Fourier features (see Section 4 for a detailed analysis). Furthermore, the T aylor features are not shift and rotation inv ariant, even though the Gaussian kernel itself is of course shift and rotation in variant. Howe ver , we argue that when choosing an explicit feature representation, one should focus not on the number of features used by the representation, but rather on the computational cost of computing it. In online (or stochastic) optimization, each example is considered only once, or perhaps a few times, and the cost to the SVM optimizer of each step is essentially just the cost of reading the feature vector . Even if each training example is considered sev eral times, the dataset will often be suf ﬁciently large that precomputing and sa ving all feature vectors is infeasible. For example, consider a data set of hundreds of millions of examples, and an explicit feature mapping with 100,000 features. Although it might be possible to store the input representation in memory , it would require tens of terabytes to store the feature v ectors. Instead, one will need to re-compute each feature v ectors when required. The computational cost of training is then dominated by that of the computing the feature , and we should judge the utility of a feature mapping not by the approximation quality as a function of dimensionality , but rather as a function of computational cost. W e will discuss how the cost of computing the T aylor features can be dramatically less than that of the random Fourier features, especially for sparse input data. In fact, the adv antage of the T aylor features ov er the random F ourier features for sparse data is directly related to the T aylor features not being rotationally and shift inv ariant, as these operations do not preserve sparsity . W e demonstrate empirically that on many benchmark datasets, although the T aylor representation requires many more features to achiev e the same approximation quality as random Fourier features, it nev ertheless outperforms a random Fourier features in terms of approximation and prediction quality as a function of the computational cost. Related W ork Fine and Scheinberg [2002] and Balcan et al. [2006] suggest obtaining a lo w-dimensional approxima- tion to an arbitrary kernel by approximating the empirical Gram matrix. Such approaches in variably in volve calculating a factorization of (at least a large subset of) the Gram matrix, an operation well beyond reach for lar ge data sets. Here, we use an efﬁcient non-data-dependent approximation that relies on analytic properties of the Gaussian k ernel. A similar approximation of the Gaussian kernel by a low-dimensional T aylor expansion was proposed by Y ang et al. [2006], who used this approximation to speed up a conjugate gradient optimizer . Xu et al. [2004] also proposed the use of the T aylor expansion to explicitly approximate the Hilbert space induced by the Gaussian k ernel, but presented neither experiments nor a quantitativ e discussion of the approximation. W e are not aware of any comparison of the Fourier features with the T aylor features, beyond a passing mention by Rahimi and Recht [2008] that the number of T aylor features required for good approximation grows rapidly . In particular, we are not aware of a previous analysis taking into account the computational cost of generating the features, which is an important issue that, as we discuss here, changes the picture entirely . 2 K ernel Projections and Appr oximations Consider a classiﬁer based on a predictor f : X → R , which is trained by minimizing the regularized training error on a training set of examples S = { x i , y i } m i = 1 , where x i ∈ X and y i ∈ Y . Here we take Y = {± 1 } , and minimize the hinge loss, although our approach holds for other loss functions, including multiclass and structured loss. The “kernel trick” is a popular strategy which permits using linear predictors in some implicit Hilbert space H , i.e. predictors of the form f ( x ) = h w , φ ( x ) i , where k w k H is regularized, and φ : X → H is giv en implicitly in terms 2 of a kernel function K ( x , x 0 ) = h φ ( x ) , φ ( x 0 ) i . The Representer Theorem guarantees that the predictor minimizing the regularized training error is of the form: f ∗ ( x ) = m ∑ i = 1 α i h φ ( x i ) , φ ( x ) i = m ∑ i = 1 α i K ( x i , x ) (1) for some set of coefﬁcients α i ∈ R . It sufﬁces then, to search over the coefﬁcients α i ∈ R when training. Howe ver , when the size of the training set, m , is very large, it can be very expensiv e to ev aluate (1) for e ven a single x . For example, for a d dimensional input space X = R d , and with a kernel whose e valuation runtime is e ven just linear in d (e.g. the Gaussian kernel, as well as most other simple kernels), ev aluating (1) requires O ( d · m ) operations. The goal of this paper is to study an explicit ﬁnite dimensional approximation ˜ φ : R d → R D to the mapping φ , which alleviates the need to use the representation (1). W e will then consider classiﬁers ˜ f of the form: ˜ f ( x ) = h ˜ w , ˜ φ ( x ) i where ˜ w ∈ R D is a weight vector which we represent explicitly . Evaluating ˜ f ( x ) requires O ( D ) operations, which is better than the representation (1) when D  d · m . One option for constructing such an approximation is to project the mapping φ onto a D -dimensional subspace of H : ˜ φ ( x ) = P φ ( x ) . This raises the question of how one may most effecti vely reduce the dimensionality of the subspace within which we work, while minimizing the resulting approximation error . Our ﬁrst result will bound the error which results from solving the SVM problem on a subspace of H . Consider the kernel Support V ector Machines (SVM) optimization problem (using ( · ) + to denote max { 0 , ·} ): min w ∈ H p ( w ) = λ 2 k w k 2 H + 1 m m ∑ i = 1 ( 1 − y i h w , φ ( x i ) i ) + (2) and denote by ˜ p ( ˜ w ) the objectiv e function which results from replacing the mapping φ with the approximate mapping ˜ φ . Recall also that K ( x , x 0 ) = h φ ( x ) , φ ( x ) i and denote ˜ K ( x , x 0 ) = h ˜ φ ( x ) , ˜ φ ( x 0 ) i Theorem 1. Let p ∗ = inf w p ( w ) be the optimum value of (2) . F or any appr oximate mapping ˜ φ ( x ) = P φ ( x ) deﬁned by a pr ojection P, let ˜ p ∗ = inf ˜ w ˜ p ( ˜ w ) be the optimum value of the SVM with r espect to this featur e mapping. Then: p ∗ ≤ ˜ p ∗ ≤ p ∗ + 1 m √ λ m ∑ i = 1 q K ( x i , x i ) − ˜ K ( x i , x i ) Note that since we also hav e k ˜ φ ( x ) k ≤ k φ ( x ) k , it is meaningful to compare the objectiv e values of the SVM. Pr oof. For an y w , we will have that p ( P w ) ≤ ˜ p ( w ) since k P w k 2 H ≤ k w k 2 H , while the loss term will be identical. This implies that p ∗ ≤ ˜ p ∗ . For the second part of the inequality , note that: | ( 1 − y i h w , φ ( x i ) i ) + − ( 1 − y i h w , P φ ( x i ) i ) + | ≤ |h w , φ ( x i ) − P φ ( x i ) i| ≤ k w k H k P ⊥ φ ( x i ) k H which implies that ˜ p ( w ) ≤ p ( w ) + 1 m k w k H ∑ m i = 1 k P ⊥ φ ( x i ) k H for any w , and in particular for w ∗ the optimum of p ( w ) . This, combined with k w ∗ k H ≤ 1 √ λ [Shalev-Shw artz et al., 2007] yields: ˜ p ∗ ≤ p ∗ + 1 m √ λ m ∑ i = 1 k P ⊥ φ ( x i ) k H = p ∗ + 1 m √ λ m ∑ i = 1 q K ( x i , x i ) − ˜ K ( x i , x i ) 3 Theorem 1 suggests using a low-dimensional projection minimizing ∑ m i = 1 p K ( x i , x i ) − ˜ K ( x i , x i ) = ∑ m i = 1 k φ ( x ) − P φ ( x ) k . That is, that one should choose a subspace of H with small average distances to the data (not squared distances as in PCA). The T aylor approximation we suggest is such a projection, albeit not the optimal one, so we can apply Theorem 1 to analyze its approximation properties. Appr oximating with Random Featur es A different option for approximating the mapping φ for a radial kernel of the form K ( x , x 0 ) = K ( x − x 0 ) , was proposed by Rahimi and Recht [2007]. They proposed mapping the input data to a randomized low-dimensional feature space as follows. Let ˆ K ( ω ) be the real-valued Fourier transform of the kernel K ( x − x 0 ) , namely K ( x − x 0 ) = Z R d ˆ K ( ω ) cos ω · ( x − x 0 ) d ω (3) Bochner’ s theorem ensures that if K ( x − x 0 ) is properly scaled, then ˆ K ( ω ) is a proper probability distribution. Hence: K ( x − x 0 ) = E ω ∼ ˆ K ( ω ) [ cos ω · ( x − x 0 )] = E ω ∼ ˆ K ( ω ) [ cos ( ω · x + θ ) · cos ( ω · x 0 + θ )] . The kernel function can then be approximated by independently drawing ω 1 , . . . , ω D ∈ R d from the distrib ution ˆ K ( ω ) and θ 1 , . . . , θ D uniformly from [ 0 , 2 π ] , and using the explicit feature mapping: ˜ φ j ( x ) = cos ( ω j · x + θ j ) (4) In the case of the Gaussian kernel, K ( x − x 0 ) = exp  −k x − x 0 k 2 / 2 σ 2  , and ˆ K ( ω ) = ( 2 π ) − D / 2 exp  −k ω k 2 / 2 σ 2  deﬁnes a Gaussian distribution, from which it is easy to dra w i.i.d. samples. The following guarantee was provided on the con vergence of kernel values ˜ K ( x , x 0 ) = h ˜ φ ( x ) , ˜ φ ( x 0 ) i corresponding to the random Fourier feature mapping: Claim 1 (Rahimi and Recht, Claim 1) . Let ˜ K be the kernel deﬁned by D r andom F ourier featur es, and R be the r adius (in the input space) of the training set, then for any ε > 0 : Pr " sup k x k , k y k≤ R   K ( x , y ) − ˜ K ( x , y )   ≥ ε # ≤ 2 8 d ε 2  R σ  2 e − D ε 2 4 ( 2 + d ) (5) It is also worth mentioning that the random Fourier features are inv ariant to translations and rotations, as is the kernel itself. Howe ver , due to the fact that each corresponds to an independent random projection, a collection of such features will not, in general, be an orthogonal projection, implying that Theorem 1 does not apply . 3 T aylor F eatures In this section we present an alternativ e approximation of the Gaussian kernel, which will be obtained by a projection onto a subspace of H . The idea is to use the T aylor series expansion of the Gaussian kernel function with respect to h x , x 0 i , where each term in the T aylor series can then be e xpressed as a sum of matching monomials in x and x 0 . More speciﬁcally , we express the Gaussian kernel as: K ( x , x 0 ) = e − k x − x 0 k 2 2 σ 2 = e − k x k 2 2 σ 2 e − k x 0 k 2 2 σ 2 e h x , x 0 i σ 2 (6) The ﬁrst two factors depend on x and x 0 separately , so we focus on the third factor . The term z = h x , x 0 i / σ 2 is a real number , and using the (scalar) T aylor expansion of e z around z = 0 we hav e: e h x , x 0 i σ 2 = ∞ ∑ k = 0 1 k !  h x , x 0 i σ 2  k (7) 4 W e now e xpand: h x , x 0 i k = ( d ∑ i = 1 x i x 0 i ) k = ∑ j ∈ [ d ] k k ∏ i = 1 x j i ! k ∏ i = 1 x 0 j i ! (8) where j enumerates over all selections of k coordinates of x (for simplicity of presentation, we allow repetitions and enumerate ov er different orderings of the same coordinates, thus a voiding explicitly writing down the multinomial coefﬁcients). W e can think of (8) as an inner product between degree k monomials of the coordinates of x and x 0 . Plugging this back into (7) and (6) results in the following e xplicit feature representation for the Gaussian kernel: φ k , j ( x ) = e − k x k 2 2 σ 2 1 σ k √ k ! k ∏ i = 0 x j i (9) with K ( x , x 0 ) = h φ ( x ) , φ ( x 0 ) i = ∏ ∞ k = 0 ∏ j ∈ [ d ] k φ k , j ( x ) φ k , j ( x 0 ) . Now , for our approximate feature space, we project onto the coordinates of φ ( · ) corresponding to k ≤ r , for some degree r . That is, we take ˜ φ k , j ( x ) = φ k , j ( x ) for k ≤ r . This corresponds to truncating the T aylor expansion (7) after the r th term. W e would like to bound the error introduced by this approximation, i.e. bound | K ( x , x 0 ) − ˜ K ( x , x 0 ) | where: ˜ K ( x , x 0 ) = h ˜ φ ( x ) , ˜ φ ( x 0 ) i = e − k x k 2 + k x 0 k 2 2 σ 2 r ∑ k = 0 1 k !  h x , x 0 i σ 2  k (10) The dif ference | K ( x , x 0 ) − ˜ K ( x , x 0 ) | is given (up to the scaling by the leading factor) by the higher order terms of the T aylor expansion of e z , which by T aylor’ s theorem are bounded by z r + 1 ( r + 1 ) ! e α for some | α | ≤ | z | . W e may bound | α | ≤ h x , x 0 i / σ 2 and |h x , x 0 i| ≤ k x k k x 0 k , obtaining:   K ( x , x 0 ) − ˜ K ( x , x 0 )   ≤ e − k x k 2 + k x 0 k 2 2 σ 2 1 ( r + 1 ) !  h x , x 0 i σ 2  r + 1 e h x , x 0 i σ 2 ≤ 1 ( r + 1 ) !  k x k k x 0 k σ 2  r + 1 (11) As for the dimensionality D of ˜ φ ( · ) (i.e. the number of features of degree not more than r ), as presented we have d k features of degree k . But this ignores the fact that many features are just duplicates resulting from different per- mutations of j . Collecting these into a single feature for each distinct monomial (with the appropriate multinomial coefﬁcient), we ha ve  d + k − 1 k  features of degree k , and a total of D =  d + r r  features of degree at most r . 4 Theor etical Comparison of T aylor and Random F ourier Featur es W e no w compare the error bound of the T aylor features giv en in (11) to the probabilistic bound of the random F ourier features giv en in (5). W e ﬁrst note that each T aylor feature may be calculated in constant time, because each degree- k feature may be deriv ed from a de gree- k − 1 feature by multiplying it by a constant times an element of x . In fact, because each feature is proportional to a product of elements of x , on sparse datasets, the T aylor features will themselves be highly sparse, enabling one to entirely av oid calculating many features. For a vector x with ˜ d nonzeros, one may verify that there will be  ˜ d + r r  = O ( ˜ d r ) nonzero features of degree at most r , which can all be computed in overall time O ( ˜ d r ) . In contrast, computing each Fourier feature requires O ( ˜ d ) time on a vector with ˜ d nonzeros, yielding an overall time of O ( D · ˜ d ) to compute D random Fourier features. W ith this in mind, we will deﬁne B as a “budget” of operations, and will take as many features as may be computed within this budget, assuming that each nonzero T aylor feature may be calculated in one operation, and each Fourier 5 T able 1: Datasets used in our experiments. The “Dim” and “NZ” columns contain the total number of elements in each train- ing/testing vector , and the av erage number of nonzeros elements, respectiv ely . Dataset Kernel SVM Linear SVM Name T rain size T est size Dim NZ C σ 2 T est error C T est error Adult 32562 16282 123 13.9 1 40 14.9% 8 15.0% Cov1 522911 58101 54 12 3 0.125 6.2% 4 22.7% MNIST 60000 10000 768 150 1 100 0.57% 2 5.2% TIMIT 63881 22257 39 39 1 80 11.5% 4 22.7% feature in ˜ d . Setting δ = Pr    K ( x , y ) − ˜ K ( x , y )   ≥ ε  and solving (5) for ε , with D ≈ B ˜ d , yields that with probability 1 − δ , for the Fourier features:   K ( x , x 0 ) − ˜ K ( x , x 0 )   ≈ O   s ˜ d d B log  d R 2 δ σ 2    (12) For the T aylor features, B =  ˜ d + r r  implies that r + 1 ' log B log ˜ d . Applying Stirling’ s approximation to (11) yields:   K ( x , x 0 ) − ˜ K ( x , x 0 )   ≈ O   s log ˜ d log B B −  1 log ˜ d log  σ 2 log B R 2 log ˜ d    (13) Neither of the above bounds clearly dominates the other . The main advantage of the T aylor approximation, also seen in the above bounds, is that its performance only depends on the number of non-zero input dimensions ˜ d , unlike the Fourier random features which have a cost which scales quadratically with the dimension, and ev en for sparse data will depend (linearly) on the overall dimensionality . The computational budget required for the T aylor approximation is polynomial in the number of non-zero dimensions, but exponential in the effecti ve radius ( R / σ ) . Once the budget is high enough, howe ver , these features can yield a polynomial decrease in the approximation error . This suggest that T aylor approximation is particularly appropriate for sparse (potentially high-dimensional) data sets with a moderate number of non-zeros, and where the kernel bandwidth is on the same order as the radius of the data (as is often the case). 5 Experiments In this section, we describe an empirical comparison of the random Fourier features of Rahimi and Recht and the polynomial T aylor-based features described in Section 3. The question we ask is: which e xplicit feature representation provides a better approximation to the Gaussian kernel, with a ﬁx ed computational budget? Experiments were performed on four datasets, summarized in T able 1. Adult and MNIST were do wnloaded from L ´ eon Bottou’ s LaSVM web page. They , along with Blackard and Dean’ s forest covertype-1 dataset (available in the UCI machine learning repository), are well-known SVM benchmark datasets. TIMIT is a phonetically transcribed speech corpus, of which we use a subset for framewise classiﬁcation of the stop consonants. From each 10 ms frame of speech, we extracted MFCC features and their ﬁrst and second deri v ativ es. Both MNIST and TIMIT are multiclass classiﬁcation problems, which we conv erted into binary problems by performing one-versus-rest, with the digit 8 and phoneme /k/ being the positiv e classes, respecti vely . The regularization and Gaussian kernel parameters for Adult, MNIST and Cov1 are taken from Shalev-Shw artz et al. [2010], and are in turn based on those of Platt [1998] and Bordes et al. [2005]. The parameters for TIMIT were found by optimizing the test error on a held-out validation set. Of these datasets, all e xcept MNIST are fairly lo w-dimensional, and all except TIMIT are sparse. T o get a rough sense 6 Adult MNIST TIMIT Primal objectiv e 0 0.5 1 1.5 2 x 10 4 1.05 1.1 1.15 1.2 1.25 1.3 1.35 x 10 4 0 2 4 6 8 x 10 5 0 5000 10000 15000 0 0.5 1 1.5 2 x 10 5 1.5 2 2.5 3 3.5 4 x 10 4 Fourier Taylor Gaussian T esting classiﬁcation error 0 0.5 1 1.5 2 x 10 4 0.145 0.15 0.155 0.16 0.165 0.17 0 2 4 6 8 x 10 5 0 0.02 0.04 0.06 0.08 0.1 0 0.5 1 1.5 2 x 10 5 0.1 0.15 0.2 0.25 Computational cost Computational cost Computational cost Figure 1: Primal objecti ves and testing classiﬁcation errors for various numbers of Fourier and T aylor features. For the Fourier features, the markers correspond to numbers of features which are powers of two, starting at 32. For T aylor , each marker corresponds to a degree, starting at 2. The cost of calculating ˜ φ , in units of ﬂoating point operations, is displayed on the horizontal axis. The solid black lines are the primal objective function v alue and testing classiﬁcation error achiev ed by the optimal solution to the Gaussian kernel SVM problem, while the dashed lines in the bottom plots are the testing classiﬁcation error achiev ed by a linear SVM. of the beneﬁt of the Gaussian kernel for these data sets, we also include in T able 1 the best test error obtained using a linear kernel ov er C s in the range  2 − 6 , 2 6  . For each of the data sets, we compared the value of the (primal) SVM objecti ve and the classiﬁcation performance (on the test set) achie ved using varying numbers of T aylor and Fourier features. Results are reported in Figure 1 and in the left column of Figure 2. W e report results in units of the number of ﬂoating-point operations required to calculate each feature vector , taking into account sparsity , as discussed in Section 4. As was discussed earlier , this is the dominant cost in stochastic optimization methods such as Pegasos [Shalev-Shwartz et al., 2007] and stochastic dual coordinate ascent [Hsieh et al., 2008], which are the fastest methods of training lar ge-scale liner SVMs. W e used a fairly optimized SGD implementation into which the e xplicit feature vector calculations were integrated. Our actual runtimes are indeed in line with the calculated computational costs (we prefer reporting the theoretical cost as it is not implementation or architecture dependent). As can be seen from Figures 1 and 2, the computational cost required to obtain the same SVM performance is typically lower when using the T aylor features than the Fourier features, despite the exponential growth of the number of features as a function of the degree. The exception is the MNIST dataset, which has a fairly high number (over 150) of non-zero dimensions per data point, yielding an extremely sharp increase in the computational costs of higher-de gree T aylor feature expansions. T o better appreciate the difference between the dependence on the number of features and that on the computational cost we include more detailed results for the Cov1 dataset, in Figure 2. Here we again plot the value of the SVM objectiv e, this time both as function of the number of features and as a function of the computational cost. As expected, the Fourier features perform much better as a function of the number of features, but, as argued earlier, we should be 7 Cov1 Primal objectiv e 0 5000 10000 15000 2 4 6 8 10 12 14 x 10 5 Primal objectiv e 10 2 10 3 10 4 10 5 10 6 0 5 10 15 x 10 5 A verage   K − ˜ K   10 1 10 3 10 5 10 7 10 −3 10 −2 10 −1 10 0 Fourier Taylor Gaussian Computational cost # features Computational cost T esting classiﬁcation error 0 5000 10000 15000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 T esting classiﬁcation error 0.5 1 1.5 2 2.5 0 0.05 0.1 0.15 0.2 0.25 A verage   K − ˜ K   10 1 10 3 10 5 10 7 10 −3 10 −2 10 −1 10 0 Computational cost σ # features Figure 2: Left column: same as Figure 1. T op middle: primal objecti ve as a function of the total number of features (in log scale). Bottom middle: test error as a function of the Gaussian kernel parameter σ , for T aylor and random Fourier expansions of the same computational cost, compared with the true Gaussian kernel. The dot-dashed line indicates the performance of a 1-NN classiﬁer , trained using the ANN library [Arya and Mount, 1993, Mount and Arya, 2006, Bagon, 2009]. Right column: av erage value of the approximation error | K ( x , x 0 ) − ˜ K ( x , x 0 ) | ov er 100000 randomly chosen pairs of training vectors, in terms of both computational cost and total number of features. more concerned with the cost of calculating them. In order to directly measure how well each feature representation approximates the Gaussian kernel, we also include in Figure 2 a comparison of the a verage approximation error . Next, we consider the effect of the bandwidth parameter σ on the T aylor and Fourier approximations—note that the theoretical analysis for both methods deteriorates signiﬁcantly when the bandwidth decreases. This is veriﬁed in the bottom-middle plot of Figure 2, which shows that the (test) classiﬁcation error of the two approximations (with the same ﬁxed computational b udget) deteriorates, relativ e to that of the true Gaussian SVM, as the bandwidth decreases. This deterioration can be observ ed on other data sets as well. On Cov1, the deterioration is so strong that e ven though the generalization performance of the true Gaussian K ernel SVM keeps improving as the bandwidth decreases, the test errors for both approximations actually start increasing fairly quickly . It should be noted that Cov1 is atypical in this regard: nearest-neighbor classiﬁcation achiev es almost optimal results on this dataset (the dot-dashed line in the bottom-middle plot of Figure 2), and so decreasing the bandwidth, which approximates the nearest-neighbor classiﬁer , is beneﬁcial. In contrast, on the data sets in Figure 1, the optimal bandwidth for the Gaussian kernel is large enough to allow good approximation by the F ourier and T aylor approximations. Finally , in order to get some perspectiv e on the real-w orld beneﬁt of the T aylor features, we also report actual runtimes for a large scale realistic example. W e compared training times for the Gaussian kernel and the T aylor features, on the full TIMIT dataset, where the goal was framewise phoneme classiﬁcation, i.e., giv en a 10 ms frame of speech the goal is to predict the uttered phoneme from a set of 39 phoneme symbols. W e used the standard split of the dataset to training, validation test sets, and extracted MFCC features. With this set of acoustic features the common practice in to use the Gaussian kernel. Its bandwidth was selected on the validation set to be σ 2 = 19. The training set includes 1.1 million examples, and existing SVM libraries such as SVMLIB or SVMLight failed to con verge in a 8 T able 2: Comparison of the Gaussian kernel and its T aylor approximation to the polynomial kernel K ( x , y ) = ( h x , y i + 1 ) d , after scaling the data to have unit average squared norm. Here, d is the degree of the polynomial. The reported test errors are the minima over parameter choices taken from a coarse power-of-tw o based grid, within which the reported parameters are well inside the interior . Gaussian kernel SVMs were optimized using our GPU optimizer , while the others were optimized by running Pegasos for 100 epochs. Gaussian T aylor Polynomial Dataset C σ 2 T est error de gree C σ 2 T est error de gree C T est error Adult 4 100 14.9% 4 8 200 14.7% 4 4 14.8% MNIST 8 100 0.42% 2 2048 200 0.54% 2 256 0.58% TIMIT 2 40 10.8% 3 8 200 11.4% 3 64 11.6% Cov1 16 0.03125 3.3% 4 128 0.5 12.3% 4 512 13.6% reasonable amount of time (see the training time in Salomon et al. [2002]). Using our own implementation with the exact Gaussian kernel and stochastic dual coordinate ascent, the training took 313 hours (almost two weeks) on 2GHz Intel Core 2 (using one core). Using the same implementation with the kernel function replaced by its degree-3 T aylor approximation, the training took only 53 hours. The results were almost the same: multiclass accuracy of 69 . 6% for the approximated k ernel and 69 . 8% for the Gaussian k ernel. These are state-of-the-art results for this dataset [Salomon et al., 2002, Grav es and Schmidhuber, 2005]. 6 Relationship to the P olynomial K ernel Like the T aylor feature representation of the Gaussian kernel, the standard polynomial k ernel of degree r : K  x , x 0  =  x , x 0  + c  r (14) corresponds to a feature space containing all monomials of degree at most r . More speciﬁcally , the features corre- sponding to the kernel (14) can be written as: φ k , j ( x ) = s  r k  c r − k k ∏ i = 1 x j i (15) where, as in (9), k = 0 , . . . , r and j ∈ [ d ] k enumerates over all selections of k coordinates in x . The difference, relativ e to the T aylor approximation to the Gaussian, is only in a per-example overall scaling factor based on k x k , and in a different per-degree f actor (which depends only on the degree k ). This weighting by a de gree-dependent factor should not be taken lightly , as it affects regularization, which is ke y to SVM training–features scaled by a larger factor are “cheaper” to use, compared to those scaled by a v ery small f actor . Comparing the degree-dependent scaling in the two feature representations ((9) and (15)), we observe that the higher degrees are scaled by a much smaller factor in the T aylor features, owing to the rapidly decreasing dependence on 1 / √ k !. This means that higher degree monomials are comparativ ely much more e xpensiv e for use in the T aylor features, and that the learned predictor lik ely relies more on lower de gree monomials. Nev ertheless, the space of allowed predictors is nearly the same with both types of features, raising the question of how strong the actual effect of the different per-degree weighting is. The fact that all of the features in the T aylor representation are scaled by a factor depending on k x k should make little difference on many datasets, as it affects all of the features of a gi ven example equally . Likewise, if most of the used features are of the same degree, then we could perhaps correct for the degree-based scaling by changing the regularization parameter . The problem, of course, is searching for and selecting this parameter . W e checked if we could ﬁnd a substantial difference in performance between the T aylor and standard polynomial features. Because the dependence on the regularization parameter necessitated a search over the parameter space, we 9 conducted a rough experiment in which we tried different parameters, and compared the best error achiev ed on the test set using the true Gaussian kernel, a T aylor approximation, and a standard polynomial kernel of the same degree. The results are summarized in T able 2. These experiments indicate that the standard polynomial features might be sufﬁcient for approximating the Gaussian. Still, the T aylor features are just as easy to compute and use, and have the advantage that they use the same parameters as the Gaussian kernel. Hence, if we already have a sense of good bandwidth and C parameters for the Gaussian kernel, we can use the same v alues for the T aylor approximation. 7 Summary The use of explicit monomial features of the form of (15) has been discussed recently as a way of speeding up training with the polynomial kernel [Sonnenburg and Franc, 2010, Chang et al., 2010]. Our analysis and experiments indicate that a similar monomial representation is also suitable for approximating the Gaussian kernel. W e argue that such features might often be preferable to the random Fourier features recently suggested by Rahimi and Recht [2007]. This is especially true on sparse datasets with a moderate number (up to sev eral dozen) of non-zero dimensions per data point. Although we have only focused on binary classiﬁcation, it is important to note that the this explicit feature represen- tation can be used anywhere else ` 2 regularization is used. This includes multiclass, structured and latent SVMs. The use of such feature expansions might be particularly beneﬁcial to structured SVMs, since these problems are hard to solve with only a kernel representation. Refer ences S. Arya and D. M. Mount. Approximate nearest neighbor queries in ﬁxed dimensions. In Proc. SOD A 1993 , pages 271–280, 1993. S. Bagon. Matlab class for ANN, February 2009. URL http://www.wisdom.weizmann.ac.il/ ~ bagon/matlab. html . M.-F . Balcan, A. Blum, and S. V empala. Kernels as features: On kernels, margins, and low-dimensional mappings. In Machine Learning , v olume 65 (1), pages 79–94. Springer Netherlands, October 2006. A. Bordes, S. Ertekin, J. W eston, and L. Bottou. Fast kernel classiﬁers with online and activ e learning. JMLR , 6: 1579–1619, September 2005. Y .-W . Chang, C.-J. Hsieh, K.-W . Chang, M. Ringgaard, and C.-J. Lin. T raining and testing low-de gree polynomial data mappings via linear SVM. JMLR , 99:1471–1490, August 2010. S. Fine and K. Scheinber g. Ef ﬁcient SVM training using low-rank k ernel representations. JMLR , 2:243 – 264, March 2002. A. Grav es and J. Schmidhuber . Framewise phoneme classiﬁcation with bidirectional LSTM and other neural network architectures. Neural Networks , 18:602–610, 2005. C.-J. Hsieh, K.-W . Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large- scale linear SVM. In Pr oc. ICML 2008 , pages 408–415, 2008. D. M. Mount and S. Arya. ANN: A library for approximate nearest neighbor searching, August 2006. URL http: //www.cs.umd.edu/ ~ mount/ANN . J. C. Platt. Fast training of support vector machines using Sequential Minimal Optimization. In B. Sch ¨ olkopf, C. Burges, and A. Smola, editors, Advances in K ernel Methods - Support V ector Learning . MIT Press, 1998. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Pr oc. NIPS 2007 , 2007. 10 A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In Pr oceedings of the 46th Annual Allerton Confer ence , 2008. J. Salomon, S. King, and M. Osborne. Framewise phone classiﬁcation using support v ector machines. In Pr oceedings of the Seventh International Confer ence on Spoken Langua ge Pr ocessing , pages 2645–2648, 2002. S. Shalev-Shwartz, Y . Singer , and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Pr oc. ICML 2007 , pages 807–814, 2007. S. Shale v-Shwartz, Y . Singer, N. Srebro, and A. Cotter . Pegasos: Primal Estimated sub-GrAdient SOlv er for SVM. In Mathematical Pr ogr amming , pages 1–34. Springer, October 2010. S. Sonnenbur g and V . Franc. COFFIN: a computational framework for linear SVMs. In Pr oc. ICML 2010 , 2010. J.-W . Xu, P . P . Pokharel, K.-H. Jeong, and J. C. Principe. An explicit construction of a reproducing Gaussian kernel Hilbert space. In Pr oc. NIPS 2004 , 2004. C. Y ang, R. Durasiwami, and L. Davis. Efﬁcient kernel machine using the improv ed fast Gaussian transform. In Pr oc. IEEE International Confer ence on Acoustic, Speech and Signal Pr ocessing , 2006. 11

Explicit Approximations of the Gaussian Kernel

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment