Linearized GMM Kernels and Normalized Random Fourier Features

Linearized GMM Kernels and Normalized Random F ourier F eatures Ping Li Departmen t of Statistics and Biostatistics Departmen t of Computer Science Rutgers Univ ersit y Piscata w a y , NJ 08854, USA pingli@stat .rutgers.edu Abstract The metho d of “random F ourier features (RFF)” has be c ome a po pula r tool for approximating the “ radial basis function (RBF)” kernel. The v ariance of RFF is actually lar g e. Int erestingly , the v ariance can be substantially reduced b y a simple normaliz a tion step as we theoretically demonstrate. W e name the impro ved scheme as the “normalized RFF (NRFF)”, a nd we provide a technical pro of of the theor etical v ariance of NRFF, as v alidated by sim ulations. W e also prop ose the “ g eneralized min-max (GMM)” kernel as a measure of data s imilarity , wher e data v ectors can hav e b oth p ositive and neg ative entries. GMM is p o sitive deﬁnite as there is an asso ciated hashing metho d named “gener alized co nsistent weigh ted sampling (GCWS)” whic h linearizes this no nlinear kernel. W e provide an extensive empirical ev aluatio n of the RBF k ernel and the GMM kernel o n more than 50 publicly av ailable datasets. F or a ma jority of the datasets in our exp er iments, the (tuning-free) GMM kernel outp erforms the best-tuned RBF kernel. W e conduct extensive exp eriments for compar ing the linear ized RBF kernel using NRFF hashing with the linearized GMM kernel using GCWS hashing. W e observe that, to reac h a comparable classiﬁcation a ccuracy , GCWS t ypically r equires substa nt ially few er sa mples than NRFF, even on da tasets where the original RBF kernel o utper forms the or iginal GMM kernel. As the costs of training, s torage, transmission, and pro ces s ing are prop ortional to the sa mple size, our exp eriments demonstrate that GCWS would b e a more practical scheme for larg e-scale le arning. The empirical s uccess of GCWS (compar ed to NRFF) can also be explained from a theoretical per sp ective. Firstly , the rela tive v a riance (normalized by the s quared expecta tion) of GCWS is substantially smaller than that o f NRFF, except fo r the very hig h similar it y r egion (where the v aria nces of b oth metho ds are close to zer o). Secondly , if we make a ge nt le mo del assumption on the data, w e can show a nalytically that GCWS exhibits muc h smaller v ariance than NRFF for estimating the same ob ject (e.g., the RBF kernel), except for the very high similarit y region. Inspired b y this work, [1 5] develop ed “tunable GMM kernels” which in man y datasets consider- ably improv e the (tuning-free) GMM kernel. In fact, kernel SVMs with tunable GMM kernels can b e strong comp etitor s to deep nets a nd b o osted tr ees. [14] compared GMM with the no rmal- ized GMM k ernel and the in tersection kernel. [13] repor ted the exp e riments for linearizing GMM with the Nystr om metho d. [19] develop ed a theore tical fra mework for a nalyzing the conv ergence prop erty of the GMM kernel using classical statistics, b y making mo del assumptions. W e expect that GMM and GCWS (and their v ar iants) will b e adopted in practice for large-sca le statistical learning and eﬃcient near neighbor sea r ch (as GCWS genera tes discrete hash v alues). 1 1 In tro duction It is p opular in mac hine learning practice to use linear alg orithms such as logistic regression or linear SVM. It is kno w n that one can often impr o v e the p erformance of linear method s by using nonlinear algorithms su c h as k ern el SVMs, if the computational/storage burden can b e r esolv ed . In this pap er, w e in tro duce an eﬀectiv e measure of data similarity termed “generalized min-max (GMM)” k ernel and the asso ciated hashing method named “g eneralized consisten t w eigh ted sampling (GCWS)” , whic h eﬃcien tly con verts this nonlinear k ernel into linear k ernel. Moreo ver, we w ill also in tro duce what w e call “normalized random F ourier features (NRFF)” and compare it with GCWS. W e start the in tro du ction with the basic linear k ernel. Consid er tw o data vec tors u, v ∈ R D . It is common to use the normalized linear k ernel (i.e., the correlatio n): ρ = ρ ( u, v ) = P D i =1 u i v i q P D i =1 u 2 i q P D i =1 v 2 i (1) This n orm alizatio n step is in general a recommended pr actice. F or example, when using LIBLIN- EAR or LIBS VM pac k ages [6], it is often su ggested to ﬁrst normalize the inp ut data vec tors to unit l 2 norm. I n add ition to pac k ages suc h as LIBLINEAR wh ic h implement batc h linear algorithms, metho ds based on s to c hastic gradien t descen t (SGD) b ecome increasingly imp ortan t esp ecially for truly large-scale indus tr ial applicati ons [2]. In this pap er, the prop osed GMM k ernel is deﬁned on general data t yp es whic h can ha v e b oth negativ e and p ositive entries. The basic idea is to ﬁrst tran s form the original data in to nonnegativ e data and then compute the min-max k ernel [20 , 9, 12] on the transf orm ed data. 1.1 Data T ransformation Consider the original data v ector u i , i = 1 to D . W e deﬁne the follo wing transformation, dep ending on whether an en try u i is p ositive or negativ e: 1  ˜ u 2 i − 1 = u i , ˜ u 2 i = 0 if u i > 0 ˜ u 2 i − 1 = 0 , ˜ u 2 i = − u i if u i ≤ 0 (2) F or example, wh en D = 2 and u = [ − 5 3], the transformed data v ector b ecomes ˜ u = [0 5 3 0]. 1.2 Generalized Min-Max (GMM) K ernel Giv en t w o data vec tors u, v ∈ R D , w e ﬁrs t trans f orm them in to ˜ u, ˜ v ∈ R 2 D according to (2). Then the generalized min-max (GMM) similarit y is deﬁned as GM M ( u, v ) = P 2 D i =1 min( ˜ u i , ˜ v i ) P 2 D i =1 max( ˜ u i , ˜ v i ) (3) W e will show in Section 4 that GMM is indeed an eﬀectiv e measure of data similarit y through an extensiv e exp er im ental study on k ern el SVM classiﬁcation. 1 This transf ormation can be generalized b y considering a “cen ter v ector” µ i , i = 1 to D , such that  ˜ u 2 i − 1 = u i − µ i , ˜ u 2 i = 0 if u i > µ i ˜ u 2 i − 1 = 0 , ˜ u 2 i = − u i + µ i if u i ≤ µ i In this pap er, w e alw ays use µ i = 0 , ∀ i . Note that the same center vector µ should b e used for all data vectors. 2 It is generally n on trivial to scale n onlinear k ernels for large data [3]. In a sense, it is not practi- cally meaningful to discu ss n onlinear k ernels without kno wing ho w to compu te them eﬃciently (e.g ., via hashing). I n this pap er, w e fo cus on the generalized consisten t weigh ted sampling (GCWS). 1.3 Generalized Consisten t W eigh ted Sampling (GCWS) Algorithm 1 su mmarizes the “generalized consisten t weig ht ed sampling” (GCWS). Giv en t w o d ata v ectors u and v , we transform them into nonnegativ e v ectors ˜ u and ˜ v as in (2). W e then apply th e original “consisten t weig hte d sampling” (CWS) [20, 9] to generate random tuples:  i ∗ ˜ u,j , t ∗ ˜ u,j  and  i ∗ ˜ v ,j , t ∗ ˜ v ,j  , j = 1 , 2 , ..., k (4) where i ∗ ∈ [1 , 2 D ] and t ∗ is unb ounded. F ollo w ing [20, 9], w e ha v e the b asic probability result. Theorem 1 Pr  i ∗ ˜ u,j , t ∗ ˜ u,j  =  i ∗ ˜ v ,j , t ∗ ˜ v ,j  = GM M ( u, v ) (5) Algorithm 1 Generaliz ed Consistent W eigh ted Sampling (GCWS). Note that w e sligh tly re-write the expression for a i compared to [9]. Input: Data vec tor u = ( i = 1 to D ) T ransform: Generate v ector ˜ u in 2 D -dim b y (2) Output: C onsisten t uniform sample ( i ∗ , t ∗ ) F or i f rom 1 to 2 D r i ∼ Gamma (2 , 1), c i ∼ Gamma (2 , 1), β i ∼ U nif or m (0 , 1) t i ← ⌊ log ˜ u i r i + β i ⌋ , a i ← log ( c i ) − r i ( t i + 1 − β i ) End F or i ∗ ← ar g min i a i , t ∗ ← t i ∗ With k samples, we can simply use the a v eraged ind icator to estimate GM M ( u, v ). By prop ert y of the binomial distribu tion, w e kno w the exp ectation ( E ) and v ariance ( V ar ) are E  1 { i ∗ ˜ u,j = i ∗ ˜ v ,j and t ∗ ˜ u,j = t ∗ ˜ v ,j }  = GM M ( u, v ) , (6) V ar  1 { i ∗ ˜ u,j = i ∗ ˜ v ,j and t ∗ ˜ u,j = t ∗ ˜ v ,j }  = (1 − GM M ( u, v )) GM M ( u, v ) (7) The estimation v ariance, giv en k s amples, will b e 1 k (1 − GM M ) GM M , wh ic h v anishes as GMM approac hes 0 or 1, or as the sample size k → ∞ . 1.4 0-bit GCWS for Linearizing GMM Kernel SVM The so-called “0-bit” GC WS idea is that, based on in tensive empirical observ ations [12], o ne can safely ignore t ∗ (whic h is unboun d ed) and simp ly use Pr  i ∗ ˜ u,j = i ∗ ˜ v ,j  ≈ GM M ( u, v ) (8) F or eac h data vecto r u , w e obtain k random samples i ∗ ˜ u,j , j = 1 to k . W e store o nly th e lo w est b bits of i ∗ , b ased on the idea of [18]. W e need to view those k in tegers as lo cations (of the nonzeros) instead of n umerical v alues. F or example, when b = 2, w e should view i ∗ as a vecto r of length 2 b = 4. If i ∗ = 3, then w e co d e it as [1 0 0 0]; if i ∗ = 0, we co d e it as [0 0 0 1]. W e can concatenate all k suc h ve ctors into a binary v ector of length 2 b × k , with exactly k 1’s. 3 F or linear method s, the computational cost is largely determined b y the num b er of nonzeros in eac h data v ector, i.e., the k in our case. F or the other parameter b , we r ecommend to use b ≥ 4. The natural comp etitor of the GMM kernel is the RBF (radial b asis function) k ernel, and the comp etitor of the GCWS hash in g metho d is th e RFF (random F ourier f eature) algorithm. 2 RBF Kernel and Normalized Random F ourier F eatures (NRFF) The radial b asis fun ction (RBF) kernel is w idely used in mac hine learning and b eyo nd. In this study , for con v enience (e.g., parameter tuning), we recommend the follo wing v ersion: RB F ( u, v ; γ ) = e − γ (1 − ρ ) (9) where ρ = ρ ( u, v ) is the correlation d eﬁned in (1) and γ > 0 is a crucial tu ning parameter. Based on Bo chner’s Theorem [24], it is kno wn [22] that, if we sample w ∼ un if or m (0 , 2 π ), r i ∼ N (0 , 1) i.i.d., and let x = P D i =1 u i r ij , y = P D i =1 v i r ij , where k u k 2 = k v k 2 = 1, th en w e ha v e E  √ 2 cos( √ γ x + w ) √ 2 cos( √ γ y + w )  = e − γ (1 − ρ ) (10) This pro vides a nice mec hanism for linearizing the RBF kernel and the RFF metho d has b ecome p opular in mac hine learning, computer vision, and b ey ond, e.g., [21, 27, 1, 7, 5, 28, 8, 25, 4, 23]. Theorem 2 Given x ∼ N (0 , 1) , y ∼ N (0 , 1) , E ( xy ) = ρ , and w ∼ un if or m (0 , 2 π ) , we have E h √ 2 cos( √ γ x + w ) √ 2 cos( √ γ y + w ) i = e − γ (1 − ρ ) (11) E [cos( √ γ x ) cos( √ γ y )] = 1 2 e − γ (1 − ρ ) + 1 2 e − γ (1+ ρ ) (12) V ar h √ 2 cos( √ γ x + w ) √ 2 cos( √ γ y + w ) i = 1 2 + 1 2  1 − e − 2 γ (1 − ρ )  2 (13) The pro of for (13) can also b e foun d in [26]. One can see that the v ariance of RFF can b e large. In terestingly , the v ariance can b e sub stan tially reduced if w e normalize the hashed data , a pro cedur e whic h w e call “normalized RFF (NRFF)”. Th e theoretica l results are presented in T h eorem 3. Theorem 3 Consider k iid samples ( x j , y j , w j ) wher e x j ∼ N (0 , 1) , y j ∼ N (0 , 1) , E ( x j y j ) = ρ , w j ∼ unif or m (0 , 2 π ) , j = 1 , 2 , ..., k . L et X j = √ 2 cos  √ γ x j + w j  and Y j = √ 2 cos  √ γ y j + w j  . As k → ∞ , the fol lowing a symptotic normality holds: √ k   P k j =1 X j Y j q P k j =1 X 2 j q P k j =1 Y 2 j − e − γ (1 − ρ )   D = ⇒ N (0 , V n,ρ,γ ) (14) wher e V n,ρ,γ = V ρ,γ − 1 4 e − 2 γ (1 − ρ ) h 3 − e − 4 γ (1 − ρ ) i (15) V ρ,γ = 1 2 + 1 2  1 − e − 2 γ (1 − ρ )  2 (16) Ob viously , V n,ρ,γ < V ρ,γ (in particular, V n,ρ,γ = 0 at ρ = 1), i.e., the v ariance of the norm al- ized RFF is (m uc h) smaller than that of the original RFF. Figure 1 plots V n,ρ,γ V γ ,γ to visualize the impro v emen t d ue to normalization, whic h is most signiﬁcant wh en ρ is close to 1. 4 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 ρ V n / V γ = 0.1 γ = 0.2 γ = 0.5 γ = 1 γ = 2 5 10 Figure 1: Th e ratio V n,ρ,γ V γ ,γ from Theorem 3 for visualizing the impr ov ement due to normalizat ion. Note that the th eoretical results in Th eorem 3 are asymptotic (i.e., for larger k ). With k samples, th e v ariance of the original RFF is exactly V ρ,γ k , h o wev er the v ariance of the norm alized RFF (NRFF) is written as V n,ρ,γ k + O  1 k 2  . It is imp ortan t to understand the b eha vior when k is not large. F or th is p urp ose, Figure 2 pr esents the simulate d mean square error (MSE) resu lts for estimating the RBF k ernel e − γ (1 − ρ ) , conﬁrming that a): the impro v emen t due to normalization can b e subs tantial , and b): th e asymptotic v ariance formula (15) b ecomes accurate for merely k > 10. 10 0 10 1 10 2 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 k MSE ρ = 0.9, γ = 0.1 NRFF RFF 10 0 10 1 10 2 10 −4 10 −3 10 −2 10 −1 10 0 k MSE ρ = 0.9, γ = 0.5 10 0 10 1 10 2 10 −3 10 −2 10 −1 10 0 k MSE ρ = 0.9, γ = 1 10 0 10 1 10 2 10 −4 10 −3 10 −2 10 −1 10 0 k MSE ρ = 0.5, γ = 0.1 NRFF RFF 10 0 10 1 10 2 10 −3 10 −2 10 −1 10 0 k MSE ρ = 0.5, γ = 0.5 10 0 10 1 10 2 10 −2 10 −1 10 0 k MSE ρ = 0.5, γ = 1 Figure 2: A s im ulation study to v erify the asymptoti c theoretical results in T heorem 3 . With k samples, we estima te the RBF k ernel e − γ (1 − ρ ) , using b oth the original RFF and the normalize d RFF (NRFF). With 10 5 rep etitions at eac h k , w e can compute the empirical mean square err or: MSE = Bias 2 +V ar. Eac h panel presents the MSEs (solid curv es) for a particular c h oice of ( ρ, γ ), along with the theoretical v ariances: V ρ,γ k and V n,ρ,γ k (dashed curves). The v ariance of the original RFF (curv es ab o v e, or red if color is a v ailable) can b e substan tially larger than the MSE of the normalized RFF (cur v es b elo w, or blue). When k > 10, the normalized RFF pro vides an unbiase d estimate of the RBF k ernel and its emp irical MSE matc hes the theoretical asymptotic v ariance. 5 Next, w e attempt to compare RFF w ith GCWS. While ultimately w e can rely on classiﬁcat ion accuracy as a metric for p erformance, her e w e compare th eir v ariances ( V ar ) relativ e to their exp ectations ( E ) in terms of V ar/E 2 , as sho wn in Figure 3. F or GCWS, we kno w V ar /E 2 = E (1 − E ) /E 2 = (1 − E ) /E . F or the original RFF, w e ha v e V ar /E 2 = h 1 2 + 1 2  1 − E 2  2 i /E 2 , etc. Figure 3 shows that the relativ e v ariance of GCWS is substantia lly smaller than that of th e original RFF and the normalized RFF (NRFF), esp ecially w h en E is not large. F or the v ery high similarit y region (i.e., E → 1), the v ariances of b oth GCWS and NRFF appr oac h zero. 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 Var / E 2 E GCWS RFF NRFF Figure 3: Ratio o f the v ariance o ver the sq u ared exp ectation, denoted as V ar /E 2 , for the co nv e- nience of comparing RFF/NRFF with GCWS. Sm aller (lo we r) is b etter. The resu lts from Figure 3 provide one explanation wh y later we will observe that, in the classi- ﬁcation exp erimen ts, GCWS t yp ically needs substan tially few er samp les than the normalized RFF, in order to ac hiev e similar classiﬁcat ion accuracies. Note that for practica l data, the similarities among most data p oin ts are usually small (i.e., sm all E ) and hence it is n ot surpr ising th at GCWS ma y p erform sub s tan tially b etter. Also s ee Section 3 and Figure 4 for a comparison fr om the p ersp ectiv e of estimat ing RBF using GCWS based on a mo del assumption. In a sense, this dra wbac k of RFF is exp ected, due to nature of random p ro jections. F or example, as sho wn in [16, 17], the linear estimator of the correlation ρ usin g random pr o jections h as v ariance 1+ ρ 2 k , wh ere k is the num b er of pro jections. In order to mak e the v ariance small, one will ha v e to use man y pro j ections (i.e., large k ). Pro of of Theorem 2 : The follo w in g thr ee inte grals w ill b e useful in our pro of: Z ∞ −∞ cos( cx ) e − x 2 / 2 dx = √ 2 π e − c 2 / 2 Z ∞ −∞ cos( c 1 x ) cos( c 2 x ) e − x 2 / 2 dx = 1 2 Z ∞ −∞ [cos(( c 1 + c 2 ) x ) + cos(( c 1 − c 2 ) x )] e − x 2 / 2 dx = √ 2 π 2 h e − ( c 1 + c 2 ) 2 / 2 + e − ( c 1 − c 2 ) 2 / 2 i Z ∞ −∞ sin( c 1 x ) sin( c 2 x ) e − x 2 / 2 dx = √ 2 π 2 h e − ( c 1 − c 2 ) 2 / 2 − e − ( c 1 + c 2 ) 2 / 2 i 6 Firstly , w e consider in tegers b 1 , b 2 = 1 , 2 , 3 , ..., and ev aluate the follo wing general in tegral: E (cos( c 1 x + b 1 w ) cos ( c 2 y + b 2 w )) = 1 2 π Z 2 π 0 E (cos( c 1 x + b 1 t ) cos( c 2 y + b 2 t )) dt = 1 2 π Z 2 π 0 Z ∞ −∞ Z ∞ −∞ (cos( c 1 x + b 1 t ) cos( c 2 y + b 2 t )) 1 2 π 1 p 1 − ρ 2 e − x 2 + y 2 − 2 ρxy 2(1 − ρ 2 ) dxdy dt = 1 2 π Z 2 π 0 Z ∞ −∞ Z ∞ −∞ (cos( c 1 x + b 1 t ) cos( c 2 y + b 2 t )) 1 2 π 1 p 1 − ρ 2 e − x 2 + y 2 − 2 ρxy + ρ 2 x 2 − ρ 2 x 2 2(1 − ρ 2 ) dxdy dt = 1 2 π Z 2 π 0 Z ∞ −∞ 1 2 π 1 p 1 − ρ 2 e − x 2 2 cos( c 1 x + b 1 t ) dx Z ∞ −∞ cos( c 2 y + b 2 t ) e − ( y − ρx ) 2 2(1 − ρ 2 ) dy dt = 1 2 π Z 2 π 0 Z ∞ −∞ 1 2 π e − x 2 2 cos( c 1 x + b 1 t ) dx Z ∞ −∞ cos( c 2 y p 1 − ρ 2 + c 2 ρx + b 2 t ) e − y 2 / 2 dy dt = 1 2 π Z 2 π 0 Z ∞ −∞ 1 2 π e − x 2 2 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) dx Z ∞ −∞ cos( c 2 y p 1 − ρ 2 ) e − y 2 / 2 dy dt = 1 2 π Z 2 π 0 Z ∞ −∞ 1 2 π e − x 2 2 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) √ 2 π e − c 2 2 (1 − ρ 2 ) 2 dxdt = 1 2 π 1 √ 2 π e − c 2 2 (1 − ρ 2 ) 2 Z 2 π 0 Z ∞ −∞ e − x 2 2 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) dxdt Note that Z 2 π 0 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) dt = Z 2 π 0 cos( c 1 x ) cos( b 1 t ) cos( c 2 ρx ) cos( b 2 t ) dt + Z 2 π 0 sin( c 1 x ) sin( b 1 t ) sin( c 2 ρx ) sin( b 2 t ) dt − Z 2 π 0 cos( c 1 x ) cos( b 1 t ) sin( c 2 ρx ) sin( b 2 t ) dt − Z 2 π 0 sin( c 1 x ) sin( b 1 t ) cos( c 2 ρx ) cos( b 2 t ) dt When b 1 6 = b 2 , w e hav e Z 2 π 0 cos( b 1 t ) cos( b 2 t ) dt = 1 2 Z 2 π 0 cos( b 1 t − b 2 t ) + cos( b 1 t + b 2 t ) dt = 0 Z 2 π 0 sin( b 1 t ) sin( b 2 t ) dt = 1 2 Z 2 π 0 cos( b 1 t − b 2 t ) − cos( b 1 t + b 2 t ) dt = 0 If b 1 = b 2 , then Z 2 π 0 cos( b 1 t ) cos( b 2 t ) dt = Z 2 π 0 sin( b 1 t ) sin( b 2 t ) dt = π In addition, for any b 1 , b 2 = 1 , 2 , 3 , ... , we alwa ys ha ve Z 2 π 0 sin( b 1 t ) cos( b 2 t ) dt = 1 2 Z 2 π 0 sin( b 1 t − b 2 t ) + sin( b 1 t + b 2 t ) dt = 0 7 Th us, only when b 1 = b 2 w e ha v e Z 2 π 0 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) dt = π cos( c 1 x ) cos( c 2 ρx ) + π sin ( c 1 x ) sin( c 2 ρx ) = π cos(( c 1 − c 2 ρ ) x ) Otherwise, R 2 π 0 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) dt = 0. Th erefore, when b 1 = b 2 , w e ha v e E (cos( c 1 x + b 1 w ) cos ( c 2 y + b 2 w )) = 1 2 π 1 √ 2 π e − c 2 2 (1 − ρ 2 ) 2 Z 2 π 0 Z ∞ −∞ e − x 2 2 cos( c 1 x + b 1 t ) cos( c 2 ρx + b 2 t ) dxdt = 1 2 π 1 √ 2 π e − c 2 2 (1 − ρ 2 ) 2 Z ∞ −∞ e − x 2 2 π cos(( c 1 − c 2 ρ ) x ) dx = 1 2 π 1 √ 2 π e − c 2 2 (1 − ρ 2 ) 2 π √ 2 π e − ( c 1 − c 2 ρ ) 2 / 2 = 1 2 e − c 2 1 + c 2 2 − 2 c 1 c 2 ρ 2 = 1 2 e − c 2 (1 − ρ ) , when c 1 = c 2 = c This completes the pro of of th e ﬁrst momen t. Next, using the follo win g fact E cos(2 cx + 2 w ) = 1 2 π Z 2 π 0 1 √ 2 π Z ∞ −∞ cos(2 cx + 2 t ) e − x 2 / 2 dxdt = 1 2 π Z 2 π 0 1 √ 2 π 1 2 sin 2 t Z ∞ −∞ cos(2 cx ) e − x 2 / 2 dxdt = 1 4 π e − 2 c 2 Z 2 π 0 sin 2 tdt = 0 w e are ready to compute the second moment E [cos( cx + w ) cos( cy + w )] 2 = 1 4 E [cos(2 cx + 2 w ) cos (2 cy + 2 w ) + cos (2 cx + 2 w ) + cos(2 cy + 2 w )] + 1 4 = 1 4 E [cos(2 cx + 2 w ) cos (2 cy + 2 w )] + 1 4 = 1 8 e − 4 c 2 (1 − ρ ) + 1 4 and the v ariance V ar [cos( cx + w ) cos ( cy + w )] = 1 8 e − 4 c 2 (1 − ρ ) + 1 4 − 1 4 e − 2 c 2 (1 − ρ ) 8 Finally , we pro v e the ﬁrs t moment without the “ w ” random v ariable: E (cos( cx ) cos ( cy )) = Z ∞ −∞ Z ∞ −∞ cos( cx ) cos ( cy ) 1 2 π 1 p 1 − ρ 2 e − x 2 + y 2 − 2 ρxy + ρ 2 x 2 − ρ 2 x 2 2(1 − ρ 2 ) dxdy = Z ∞ −∞ 1 2 π 1 p 1 − ρ 2 e − x 2 2 cos( cx ) dx Z ∞ −∞ cos( cy ) e − ( y − ρx ) 2 2(1 − ρ 2 ) dy = Z ∞ −∞ 1 2 π e − x 2 2 cos( cx ) dx Z ∞ −∞ cos( cy p 1 − ρ 2 + cρx ) e − y 2 / 2 dy = Z ∞ −∞ 1 2 π e − x 2 2 cos( cx ) cos ( cρx ) dx Z ∞ −∞ cos( cy p 1 − ρ 2 ) e − y 2 / 2 dy = Z ∞ −∞ 1 2 π e − x 2 2 cos( cx ) cos ( cρx ) √ 2 π e − c 2 1 − ρ 2 2 dx = 1 √ 2 π e − c 2 1 − ρ 2 2 Z ∞ −∞ e − x 2 2 cos( cx ) cos( cρx ) dx = 1 √ 2 π e − c 2 1 − ρ 2 2 √ 2 π 2  e − c 2 (1 − ρ ) 2 2 + e − c 2 (1+ ρ ) 2 2  = 1 2 e − c 2 (1 − ρ ) + 1 2 e − c 2 (1+ ρ ) This completes the pro of of T heorem 2.  Pro of of Theorem 3 : W e will use some of the results from the pro of of Theorem 2. Deﬁne X j = √ 2 cos( √ γ x j + w j ) , Y j = √ 2 cos( √ γ y j + w j ) , Z k = P k j =1 X j Y j q P k j =1 X 2 j q P k j =1 Y 2 j F rom T heorem 2, it is easy to see that, as k → ∞ , w e ha v e 1 k k X j =1 X 2 j → E  X 2 j  = e − γ (1 − 1) = 1 , a.s. 1 k k X j =1 Y 2 j → 1 , a.s. Z k = 1 k P k j =1 X j Y j q 1 k P k j =1 X 2 j q 1 k P k j =1 Y 2 j → e − γ (1 − ρ ) = Z ∞ , a.s. W e express the deviation Z k − Z ∞ as Z k − Z ∞ = 1 k P k j =1 X j Y j − Z ∞ + Z ∞ q 1 k P k j =1 X 2 j q 1 k P k j =1 Y 2 j − Z ∞ = 1 k P k j =1 X j Y j − Z ∞ q 1 k P k j =1 X 2 j q 1 k P k j =1 Y 2 j + Z ∞ 1 − q 1 k P k j =1 X 2 j q 1 k P k j =1 Y 2 j q 1 k P k j =1 X 2 j q 1 k P k j =1 Y 2 j = 1 k k X j =1 X j Y j − Z ∞ + Z ∞ 1 − 1 k P k j =1 X 2 j 1 k P k j =1 Y 2 j 2 + O P (1 /k ) = 1 k k X j =1 X j Y j − Z ∞ + Z ∞ 1 − 1 k P k j =1 X 2 j 2 + Z ∞ 1 − 1 k P k j =1 Y 2 j 2 + O P (1 /k ) 9 Note that if a ≈ 1 and b ≈ 1, then 1 − ab = 1 − (1 − (1 − a ))( 1 − (1 − b )) = (1 − a ) + (1 − b ) − (1 − a )(1 − b ) and w e can ignore the higher-order term. Therefore, to analyze the asymptotic v ariance, it suﬃces to study the follo wing exp ectation E  X Y − Z ∞ + Z ∞ 1 − X 2 2 + Z ∞ 1 − Y 2 2  2 = E  X Y − Z ∞ ( X 2 + Y 2 ) / 2  2 = E ( X 2 Y 2 ) + Z 2 ∞ E ( X 4 + Y 4 + 2 X 2 Y 2 ) / 4 − Z ∞ E ( X 3 Y ) − Z ∞ E ( X Y 3 ) whic h can b e obtained f rom the results in the pro of of Th eorem 2. In particular, if b 1 = b 2 , then E (cos( c 1 x + b 1 w ) cos ( c 2 y + b 2 w )) = 1 2 e − c 2 1 + c 2 2 − 2 c 1 c 2 ρ 2 Otherwise E (cos( c 1 x + b 1 w ) cos ( c 2 y + b 2 w )) = 0. W e can no w compute E  cos( cx + w ) 3 cos( cy + w )  = E  1 4 cos(3( cx + w )) cos( cy + w ) + 3 4 cos( cx + w ) cos ( cy + w )  = 3 8 e − c 2 (1 − ρ ) E [cos( cx + w ) cos( cy + w )] 2 = 1 8 e − 4 c 2 (1 − ρ ) + 1 4 E [cos( cx + w )] 4 = 1 8 + 1 4 = 3 8 V n,ρ,γ = E  X Y − Z ∞ + Z ∞ 1 − X 2 2 + Z ∞ 1 − Y 2 2  2 = E ( X 2 Y 2 ) + Z 2 ∞ E ( X 4 + Y 4 + 2 X 2 Y 2 ) / 4 − Z ∞ E ( X 3 Y ) − Z ∞ E ( X Y 3 ) = 1 2 e − 4 c 2 (1 − ρ ) + 1 + e − 2 c 2 (1 − ρ )  3 8 + 3 8 + 1 4 e − 4 c 2 (1 − ρ ) + 1 2  − e − c 2 (1 − ρ )  3 2 e − c 2 (1 − ρ ) + 3 2 e − c 2 (1 − ρ )  = 1 2 e − 4 c 2 (1 − ρ ) + 1 + e − 2 c 2 (1 − ρ )  5 4 + 1 4 e − 4 c 2 (1 − ρ )  − 3 e − 2 c 2 (1 − ρ ) = 1 2 e − 4 c 2 (1 − ρ ) + 1 + 1 4 e − 6 c 2 (1 − ρ ) − 7 4 e − 2 c 2 (1 − ρ ) = V ρ,γ − 1 4 e − 2 c 2 (1 − ρ ) h 3 − e − 4 c 2 (1 − ρ ) i where V ρ,γ is the corresp ond in g v ariance factor without using normalization: V ρ,γ = 1 2 + 1 2  1 − e − 2 c 2 (1 − ρ )  2 This completes the pro of of T heorem 3.  10 3 Another Comparison Based on Asymptotic of GMM As p ro v ed in a tec hnical rep ort f ollo wing th is pap er [19], under mild mo del assumption, as the dimension D b ecomes large, the GMM ke rnel conv erges to a function of the true data correlation: GM M → 1 − p (1 − ρ ) / 2 1 + p (1 − ρ ) / 2 = g (17) The con verge nce holds almost su rely for d ata w ith b ound ed ﬁrst momen t. Using the expression of g we can express RBF e − γ (1 − ρ ) in terms of g : ρ = 1 − 2  1 − g 1 + g  2 , e − γ (1 − ρ ) = e − 2 γ  1 − g 1+ g  2 (18) F or the conv enience of conducting theoret ical analysis, we assume GM M = 1 − √ (1 − ρ ) / 2 1+ √ (1 − ρ ) / 2 = g , exactly instead of asymptotically . Then w e ha v e another estimato r of the RBF k ernel from GCWS. Note that with k hashes, the estimate of GMM follo ws a b inomial distribution binomial ( k , g ). Theorem 4 Assume g = 1 − √ (1 − ρ ) / 2 1+ √ (1 − ρ ) / 2 and X ∼ binomial ( k , g ) . Then, denoting ¯ X = 1 k P k i =1 X i , we have E  e − 2 γ  1 − ¯ X 1+ ¯ X  2  = e − γ (1 − ρ ) + O  1 k  (19) V ar  e − 2 γ  1 − ¯ X 1+ ¯ X  2  = V g ,γ k + O  1 k 2  (20) wher e V g ,γ = e − 2 γ (1 − ρ ) g (1 − g ) 3 (1 + g ) 6 64 γ 2 (21) Pro of of Theorem 4: F or an asymptotic analysis with large k , it suﬃces to consider Z = 1 − ˆ X 1+ ˆ X as a normal rand om v ariable, whose mean and v ariance can b e calculated t o b e µ = 1 − g 1+ g , σ 2 = 1 k 4 g (1 − g ) (1+ g ) 4 . T h us, it suﬃces to compute E  e X 2 t  = Z ∞ −∞ e x 2 t 1 √ 2 π σ e − ( x − µ ) 2 2 σ 2 dx = Z ∞ −∞ 1 √ 2 π σ e − ( x − µ ) 2 − 2 σ 2 x 2 t 2 σ 2 dx = Z ∞ −∞ 1 √ 2 π σ e − ( x − µ ) 2 − 2 σ 2 x 2 t 2 σ 2 dx = Z ∞ −∞ 1 √ 2 π σ e − (1 − 2 σ 2 t ) x 2 − 2 µx + µ 2 2 σ 2 dx = Z ∞ −∞ 1 √ 2 π σ e − x 2 − 2 µ/c 2 x + µ 2 /c 2 2 σ 2 /c 2 dx, where c 2 = 1 − 2 σ 2 t = 1 c Z ∞ −∞ 1 √ 2 π σ /c e − ( x − µ/c 2 ) 2 − µ 2 /c 4 + µ 2 /c 2 2 σ 2 /c 2 dx = 1 c e µ 2 (1 − c 2 ) 2 σ 2 c 2 = 1 c e µ 2 c 2 t = 1 √ 1 − 2 σ 2 t e µ 2 t 1 − 2 σ 2 t 11 from whic h we can compu te the v ariance (letting σ 2 = 1 k λ 2 ) V ar  e X 2 t  = E  e X 2 2 t  − E 2  e X 2 t  = 1 √ 1 − 2 σ 2 2 t e µ 2 2 t 1 − 2 σ 2 2 t − 1 1 − 2 σ 2 t e µ 2 2 t 1 − 2 σ 2 t =  1 + 2 λ 2 t k + O  1 k 2  e 2 µ 2 t  1+ 4 λ 2 t k + O  1 k 2  −  1 + 2 λ 2 t k + O  1 k 2  e 2 µ 2 t  1+ 2 λ 2 t k + O  1 k 2  =  1 + O  1 k  e 2 µ 2 t  1 + 8 µ 2 λ 2 t 2 k + O  1 k 2  −  1 + O  1 k  e 2 µ 2 t  1 + 4 µ 2 λ 2 t 2 k + O  1 k 2  = 4 µ 2 λ 2 t 2 k e 2 µ 2 t + O  1 k 2  Plugging in t = − 2 γ , µ = 1 − g 1+ g , and λ 2 = 4 g (1 − g ) (1+ g ) 4 , yields V ar  e − 2 γ  1 − ¯ X 1+ ¯ X  2  = 64 γ 2 k g (1 − g ) 3 (1 + g ) 6 e − 4 γ  1 − g 1+ g  2 + O  1 k 2  = 64 γ 2 k g (1 − g ) 3 (1 + g ) 6 e − 2 γ (1 − ρ ) + O  1 k 2   This theoretical result p ro vides a direct co mparison of GCWS w ith NRF F for estimating the same ob ject, by visualizing the v ariance ratio: V n,ρ,γ V g,γ , using r esults from Theorem 3. As shown in Figure 4, for estimating the RBF k ernel, the v ariance of GC WS is su bstan tially smaller than the v ariance of NRFF, except for the very high similarity r egion (dep endin g on γ ). At high similarit y , the v ariances of b oth method s a ppr oac h zero. This provides a nother explanation for the sup erb empirical p erform ance of GCWS compared to NRFF, as will b e rep orted later in the pap er. -1 -0.5 0 0.5 1 0 1 2 4 6 8 10 Var Ratio = 0.1 = 1 = 2 = 5 = 10 NRFF / GMM Figure 4: The v ariance ratio: V n,ρ,γ V g,γ pro vides another comparison of GCWS w ith NRFF. V g ,γ is deriv ed in Theorem 4 and V n,ρ,γ is derived in T h eorem 3. The ratios are signiﬁcan tly larger than 1 except for the v ery high similarit y region (where the v ariances of b oth metho d s are close to zero). 4 An Exp erimen tal Study on Kernel S VMs T able 1 lists datasets fr om the UCI rep ository . T able 2 presents datasts from the LIBSVM w ebsite as well as datasets which are fairly large. T able 3 con tains datasets used for ev aluating deep learning and trees [10, 11]. Except for the relativ ely la rge datasets in T able 2, w e also rep ort the classiﬁcation accuracies for th e linear SVM, ke rnel SVM with RBF, and k ernel SVM with GMM, at t he b est l 2 -regularizatio n C v alues. More detailed results (for all r egularizatio n C v alues) are a v ailable in Figures 5, 6, 7, and 8. T o ensure rep eatabilit y , w e use the LIBSVM pre-computed k ernel fun ctionalit y . This also means w e can not (easily) test nonlinear kernels on larger datasets. 12 F or the RBF k ern el, w e exhaustively exp eriment ed with 58 diﬀerent v alues of γ ∈ { 0.001, 0.01, 0.1:0.1 :2, 2.5, 3:1:20 25:5:50, 60:10:100 , 1 20, 150, 200, 300, 500, 1000 } . Basically , T ables 1, 2, and 3 rep orts the b est RBF results among all γ an d C v alues in our exp eriment s. The classiﬁcati on results in dicate that, on these datasets, k ernel (GMM and RBF) SVM clas- siﬁers imp ro v e o v er linear classiﬁers substan tially . F or m ore than half of the datasets, the GMM k ernel (which has no tuning parameter) outp erforms the b est-tuned RBF k ernel. F or a small num- b er of d atasets (e.g., “SEMG1”), even though the RBF k ernel p erforms b etter, we will sho w in Section 5 that the GCWS hashin g can still b e sub stan tially b etter th an the NRFF hashing. T able 1: Public (UCI) classiﬁcation dat asets and l 2 -regularized kernel SVM results . W e rep ort the test classiﬁca tion accuracie s for the linear k ernel, the b est-tuned RBF k ern el (and the b est γ ), and the GMM kernel, at their individually- b est SVM regularization C v alues. Dataset # train # test # dim linea r RBF ( γ ) GMM Car 864 864 6 71.53 94.91 (100 ) 98 . 96 Cov ertype25k 25000 25000 54 62.64 82.66 (90) 82.65 CTG 1063 1063 35 60.59 89.75 (0.1) 88.81 DailySp orts 4560 4560 5625 77.70 97.61 (4) 99.61 Dexter 300 300 19999 92.67 93.00 (0.01) 94. 00 Gesture 4937 4936 32 37.22 61.06 (9) 65.50 ImageSeg 210 210 0 19 83.81 91.3 8 (0.4) 95.05 Isolet2k 200 0 5797 617 93.95 95.55 (3) 95.53 MSD20k 20000 20000 90 66.72 68.0 7 (0.1 ) 7 1.05 MHealth20k 20000 20000 23 72.62 82.6 5 (0.1 ) 8 5.28 Magic 9150 9150 10 78.04 84.43 (0.8) 87 .02 Musk 3299 3299 166 95.09 99.33 (1.2) 99.24 PageBlo cks 2737 2726 10 95.87 97.08 (1.2) 96.56 Parkinson 520 520 26 61.15 66.7 3 (1.9) 69.81 P AMAP101 2 0000 20000 51 76.86 96.68 (15) 98.9 1 P AMAP102 2 0000 20000 51 81.22 95.67 (1.1) 98 .78 P AMAP103 2 0000 20000 51 85.54 97.89 (19) 99.6 9 P AMAP104 2 0000 20000 51 84.03 97.32 (19) 99.3 0 P AMAP105 2 0000 20000 51 79.43 97.34 (18) 99.2 2 Rob otNavi 2728 2728 24 69.83 90.69 (10) 96.8 5 Satimage 4435 2000 36 72.45 85.20 (200 ) 90. 40 SEMG1 900 900 3000 26.00 43.56 (4) 41.0 0 SEMG2 1800 1800 2500 19.28 29.00 (6) 54.00 Sensorless 29255 29254 48 61.53 93.0 1 (0.4 ) 9 9.39 Sh uttle500 500 14500 9 91.81 99.52 (1.6) 9 9.65 SkinSeg10k 10000 10000 3 93.36 99.74 (120 ) 99. 81 SpamBase 2301 2300 57 85.91 92.57 (0.2) 94 .17 Splice 1000 2175 60 85.10 90.02 (15) 95.2 2 Thyroid2k 2000 5200 21 94.90 97.00 (2.5) 98 .40 Urban 168 507 147 62.5 2 51.4 8 (0.01) 66.0 8 V owel 264 264 10 39.39 94.7 0 (45 ) 96.97 Y o utubeAudio1 0k 100 0 0 11 930 2000 41.35 48.63 (2) 50.59 Y o utubeHO G10k 10000 11930 647 62.77 66.2 0 (0.5) 68.63 Y o utubeMo tion10k 100 00 11 930 64 26.24 28.81 (19) 31.95 Y o utubeSa iBoxes10k 10000 11930 7168 46.97 49.31 (1.1) 51 .28 Y o utubeSp ectr um10k 1000 0 119 30 1024 26.81 33.54 (4) 39.23 13 T able 2: Datasets in group 1 and group 3 are from the LIBS VM w ebsite. Datasets in group 2 are from the UC I rep ository . Datasets in group 2 and 3 are to o large for LIBSVM pre-computed kernel functionalit y and are th us only u sed for testing hashing metho ds. Group Dataset # train # test # dim linear RBF ( γ ) GMM Letter 15000 5000 16 61.66 97.44 (11) 97.26 1 Protein 17766 6621 357 69.14 70.32 (4) 70. 64 SensIT20k 20000 19705 100 80.42 83.1 5 (0.1) 84.57 W e bs pam20k 20 000 60000 254 9 3.00 97.99 (35) 97.88 2 P AMAP10 1Larg e 186,581 186,580 51 7 9 .19 P AMAP105Large 185,5 48 185,548 51 83 .35 IJCNN 49990 91701 22 92.56 3 R C V1 338,69 9 338,70 0 47 ,236 9 7.66 SensIT 78,823 19,70 5 100 80 .55 W e bs pam 175,00 0 175,00 0 254 93 .31 T able 3: Datasets from [10, 11]. See the tec hnical rep ort [15] on “tun able GMM ke rnels” for substanti ally improv ed r esu lts, b y int ro du cing tuning parameters in the GMM k ernel. Dataset # train # test # dim linear RBF ( γ ) G MM M-Basic 1200 0 500 00 784 89 .98 97.21 (5) 96.20 M-Image 1200 0 50 000 784 70 .71 77.84 (16) 80.85 M-Noise1 10000 4000 784 60.28 66.83 (10) 71.38 M-Noise2 10000 4000 784 62.05 69.15 (11) 72.43 M-Noise3 10000 4000 784 65.15 71.68 (11) 73.55 M-Noise4 10000 4000 784 68.38 75.33 (14) 76.05 M-Noise5 10000 4000 784 72.25 78.70 (12) 79.03 M-Noise6 10000 4000 784 78.73 85.33 (15) 84.23 M-Rand 12000 50000 784 78.90 85.39 (12) 84.22 M-Rotate 12000 50 000 784 47.99 89.68 (5) 84.76 M-RotImg 12000 5 0000 784 31.44 45.84 (18) 40.98 14 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 70 80 90 100 C Accuracy (%) Car GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 50 60 70 80 90 C Accuracy (%) Covertype25k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 20 40 60 80 100 C Accuracy (%) CTG GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) DailySports GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 91 92 93 94 95 C Accuracy (%) Dexter GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 30 40 50 60 70 C Accuracy (%) Gesture GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 60 70 80 90 100 C Accuracy (%) ImageSeg GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 80 85 90 95 100 C Accuracy (%) Isolet2k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 62 64 66 68 70 72 C Accuracy (%) MSD20k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 70 75 80 85 90 C Accuracy (%) MHealth20k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 70 75 80 85 90 C Accuracy (%) Magic GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 90 92 94 96 98 100 C Accuracy (%) Musk GMM RBF linear Figure 5: T est classiﬁcation accuracies using kernel SVMs . Both the GMM kernel a nd RBF k ernel subs tan tially improv e linear SVM. C is the l 2 -regularizatio n parameter of SVM. F or the RBF k ernel, we rep ort the result at the b est γ v alue for ev ery C v alue. 15 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 90 92 94 96 98 C Accuracy (%) PageBlocks GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 55 60 65 70 C Accuracy (%) Parkinson GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) PAMAP101 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) PAMAP102 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) PAMAP103 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) PAMAP104 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) PAMAP105 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) RobotNavi GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 65 70 75 80 85 90 95 C Accuracy (%) Satimage GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 20 30 40 50 C Accuracy (%) SEMG1 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 20 30 40 50 60 C Accuracy (%) SEMG2 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 40 50 60 70 80 90 100 C Accuracy (%) Sensorless GMM RBF linear Figure 6: T est classiﬁcation accuracies using kernel SVMs . Both the GMM kernel a nd RBF k ernel subs tan tially improv e linear SVM. C is the l 2 -regularizatio n parameter of SVM. F or the RBF k ernel, we rep ort the result at the b est γ v alue for ev ery C v alue. 16 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 90 92 94 96 98 100 C Accuracy (%) Shuttle500 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 92 94 96 98 100 C Accuracy (%) SkinSeg10k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 75 80 85 90 95 C Accuracy (%) SpamBase GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) Splice GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 92 94 96 98 100 C Accuracy (%) Thyroid2k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 20 40 60 80 C Accuracy (%) Urban GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) Vowel GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 25 30 35 40 45 50 55 C Accuracy (%) YoutubeAudio10k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 45 50 55 60 65 70 C Accuracy (%) YoutubeHOG10k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 24 26 28 30 32 34 C Accuracy (%) YoutubeMotion10k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 30 35 40 45 50 55 C Accuracy (%) YoutubeSaiBoxes10k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 25 30 35 40 C Accuracy (%) YoutubeSpectrum10k GMM RBF linear Figure 7: T est classiﬁcation accuracies using kernel SVMs . Both the GMM kernel a nd RBF k ernel subs tan tially improv e linear SVM. C is the l 2 -regularizatio n parameter of SVM. F or the RBF k ernel, we rep ort the result at the b est γ v alue for ev ery C v alue. 17 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) Letter GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 66 67 68 69 70 71 72 C Accuracy (%) Protein GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 76 78 80 82 84 86 C Accuracy (%) SensIT20k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 88 90 92 94 96 98 100 C Accuracy (%) Webspam20k GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 75 80 85 90 95 100 C Accuracy (%) M−Basic GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 C Accuracy (%) M−Image GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 55 60 65 70 75 C Accuracy (%) M−Noise1 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 55 60 65 70 75 C Accuracy (%) M−Noise2 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 55 60 65 70 75 C Accuracy (%) M−Noise3 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 55 60 65 70 75 80 C Accuracy (%) M−Noise4 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 55 60 65 70 75 80 C Accuracy (%) M−Noise5 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 65 70 75 80 85 90 C Accuracy (%) M−Noise6 GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 65 70 75 80 85 90 C Accuracy (%) M−Rand GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) M−Rotate GMM RBF linear 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 20 30 40 50 C Accuracy (%) M−RotImg GMM RBF linear Figure 8: T est classiﬁcation accuracies using kernel SVMs . Bot h the GMM k ernel and RBF k ernel substant ially impro v e linear SVM. C is the l 2 -regularizatio n parameter of SVM. F or the RBF k ernel, we rep ort th e result at the b est γ v alue for eve ry C v alue. 18 F or th e datasets in T able 3, since [10] also conducted exp eriments on the RBF k ernel, the p olynomial ke rnel, and n eural nets, w e assem bly the (error rate) results in Figure 9 and T able 4. 1 2 3 4 5 6 Dataset 10 15 20 25 30 35 Error Rate (%) SVM RBF GMM Figure 9: Error rates on 6 datasets: M-Noise1 to M-Noise6 as in T able 3. In this ﬁgure, the curv e lab eled a s “SVM” represents the results on RBF k ernel S VM condu cted by [10], while the curve lab eled as “RBF” presents our own exp eriment s. The sm all discrepancies m ight b e caused b y the fact that w e alw a ys u se normalized data (i.e., ρ ). T able 4: Summary of test err or r ates of v arious algorithms on other d atasets used in [10, 11]. Results in group 1 are rep orted b y [10 ] f or using RBF kernel, p olynomial k ernel, and neural nets. Results in group 2 are fr om our own exp eriment s. Al so, se e the tec h n ical rep ort [15] on “tun able GMM kernels” for su bstan tially im p ro v ed results, by in tro ducing tunin g parameters in the GMM k ernel. Group Metho d M-Basic M-Rotate M-Image M-Rand M-RotImg SVM-RBF 3.05 % 11.11% 22.61% 14.58% 55.18% 1 SVM- POL Y 3.69% 1 5.42% 24.01% 16.62 % 56.41% NNET 4.69% 1 8.11% 27.41% 20.04 % 62.16% Linear 10.02% 52.01% 29.29% 21.10% 68.56% 2 RB F 2.79 % 10.30 % 22.16% 14.61 % 54.16 % GMM 3.80% 15.24% 19.15 % 15.78% 59.02% 19 5 Hashing for Linearizing Nonlinear Kernels It is kno wn that a straigh tforw ard implementati on of nonlinear kernels can b e diﬃcult for large datasets [3]. F or example, for a small dataset with merely 100 , 000 data p oin ts, the 100 , 000 × 100 , 000 kernel matrix has 10 10 en tries. In practice, b eing able to linearize nonlinear k ernels b ecomes v ery b eneﬁcial, as that would allo w us to easily apply eﬃcien t linear algorithms esp ecially online learning [2]. Randomization (h ashing) is a p opu lar to ol f or k ernel linearization. In the in tro du ction, we h a v e explained how to linearize both the RBF kernel and the GMM k ernel. F rom pr actitioner’s p ersp ectiv e, w hile the k ernel classiﬁcation r esults in T ables 1, 2, and 3 are informativ e, they are not s uﬃcien t for guiding the c hoice of k ern els. F or example, as w e will sho w, f or some datasets, ev en though the RBF k ernel outp erform the GMM k ernel, the linearizati on algorithm (i.e., the normalized RFF) requires sub stan tially more samples (i.e ., larger k ). Note that in our SVM exp eriment s, we alw a ys normalize the input features to th e unit l 2 norm (i.e., we will alw a ys use NRFF instead of RFF). W e will rep ort detailed exp er im ental results on 6 d atasets. As sho wn in T able 5, o n the ﬁrs t t w o data sets, the original RBF and GM M k ern els p erform similarly; in the second grou p , th e GMM k ernel n oticeably outp erforms the RBF k ernel; in the last group, the RBF k ernel notice ably outp er- forms the GMM kernel. W e will sho w on all these 6 datasets, the GCWS hashing is substantia lly more accurate than the NRFF hashing at the same n u m b er of sample size ( k ). W e will then presen t less detailed results on other datasets. T able 5: 6 datasets used for presenting detailed exp eriment al results on GCWS and NRFF. Group Dataset # train # test # dim linea r RBF ( γ ) GMM 1 Letter 15000 5 000 16 61.66 97.4 4 (11) 97.2 6 W e bs pam20k 2 0000 60000 254 93.00 97.9 9 (35) 97.8 8 2 DailySports 4560 4560 5625 77.70 97.6 1 (4) 99 .61 Rob otNavi 2728 2728 24 69.8 3 9 0.69 (10) 96.85 3 SEMG1 900 900 30 00 26.00 43.56 (4) 41.00 M-Rotate 12000 50000 784 47.99 89.68 (5) 84.76 Figure 10 rep orts the test classiﬁcat ion accuracies on the Let ter data set, for b oth linearized GMM kernel with GCWS and linearized R BF k ernel (at the b est γ ) with NRFF, using LIBLINEAR. F rom T able 5, we can see th at the original RBF ke rnel s lightly outp erforms the GMM k ernel. Ob viously , the results obtained by GCWS hashing are n oticeably b etter than the results of NRFF hashing, esp ecially wh en the n um b er of samples ( k ) is not to o large (i.e. , the left panels). F or the “Letter” dataset, the original d imension is merely 16. It is kno w n that, for mo d ern linear algorithms, the computational cost is largely determined b y the num b er o f nonzeros. Hence the n umber of samples (i.e., k ) is a crucial parameter whic h dir ectly control s th e training complexit y . F rom the left panels of Figure 10, w e can see that with merely k = 1 6 samples, GCWS already pro du ces b etter results than the original linear metho d. This phenomenon is exciting, b ecause in ind ustrial pr actice, the goal is often to p ro du ce b etter r esults than linear metho d s without consuming m uc h more resour ces. 20 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) Letter: b = 8 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) Letter: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) Letter: b = 4 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) Letter: b = 4 k = 256 k = 1024 k = 4096 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) Letter: b = 2 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 60 70 80 90 100 C Accuracy (%) Letter: b = 2 k = 256 k = 1024 k = 4096 k = 128 Figure 10: Lette r : T est classiﬁcation acc uracies of the linearized GMM kernel (solid, GCWS) and linearized RBF k ernel (dashed, NRFF), usin g LIBLINEAR, a v eraged o v er 10 rep etitions. In eac h panel, we r ep ort the results on 4 diﬀerent k (sample size) v alues: 128, 256, 1024 , 4096 (righ t p anels), and 16, 32, 64, 128 (left p anels). W e can see that the linearized q RBF (using NRFF) wo uld require substanti ally m ore samples in order to reac h the same accuracies as the linearized GMM k ernel (using GCWS). Two in teresting p oin ts: (i) Although the original (b est-tuned) R BF k ernel sligh tly outp erforms the origi nal GMM k ernel, the r esults of GCWS are still more accurate than the r esults of RFF ev en at k = 4096, whic h is v ery large, consider in g the original data d imension is merely 16. (ii) With merely k = 16 samples ( b ≥ 4), GCWS already p ro du ces b etter results than linear SVM based on the original dataset (the solid cur v e mark ed by *). 21 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 75 80 85 90 95 100 C Accuracy (%) Webspam20k: b = 8 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 85 90 95 100 C Accuracy (%) Webspam20k: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 75 80 85 90 95 100 C Accuracy (%) Webspam20k: b = 4 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 85 90 95 100 C Accuracy (%) Webspam20k: b = 4 k = 256 k = 1024 k = 4096 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 75 80 85 90 95 100 C Accuracy (%) Webspam20k: b = 2 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 85 90 95 100 C Accuracy (%) Webspam20k: b = 2 k = 256 k = 1024 k = 4096 k = 128 Figure 11: W ebspam20k : T est classiﬁcation accuracies of the linearized GMM k ernel (solid, GCWS) and linearized RBF k ernel (dashed, NRFF), using LIBLINEAR, a verage d o v er 10 rep eti- tions. In eac h panel, we rep ort th e results on 4 diﬀeren t k (sample size) v alues: 128, 256, 1024, 409 6 (righ t p an els), and 16, 32, 64, 128 (left panels). W e can see th at the linearized RBF (using NRFF) w ould require sub stan tially more samples in order to reac h the same accuracies as the linearized GMM kernel (using GCWS). The linear SVM results are represen ted b y solid curves marked by *. 22 Figure 11 rep orts the test classiﬁcation accuracie s on the W ebspam20k dataset. Again, the results obtained by GCWS hashin g and linear classiﬁcation are n oticeably b etter than the results of NRFF hashing and linear classiﬁcation, esp ecially when the num b er of samples ( k ) is not to o large (i.e., the left panels). F or this dataset, the original dimension is 254. With GCWS h ashing and merely k = 128, w e can ac hieve higher acc uracy than using lin ear classiﬁer on the original data . Ho wev er, with NRFF hashing, w e need almost k = 1024 in ord er to outp erform linear cla ssiﬁer on the original data. Also, n ote that it is suﬃcien t to use b = 4 for GCWS h ashing on this dataset. Figure 12 and Figure 13 rep ort the te st cl assiﬁcation accuracies on t he DailySports dataset and the Rob otNavi dataset, resp ectiv ely . F or b oth datasets, the original GMM k ernel notice ably outp erforms the original RBF k ern el. Not surp risingly , NRFF hashing requires substanti ally more samples in order to reac h similar a ccuracy as GCWS hashing, o n b oth datasets. The resu lts also illustrate th at th e parameter b (i.e., the n umber of bits we store for eac h GCWS hashed v alue i ∗ ) do es matter, bu t n ev ertheless, as long as b ≥ 4, the resu lts do not diﬀer m uc h. Figure 14 and Figure 15 rep ort the test classiﬁcation accuracies on the SEMG1 dataset and M- Rotate dataset, resp ectiv ely . F or b oth d atasets, the original RBF k ernel considerably outp erforms the original GMM k ernel. Nev ertheless, NRFF hashing still needs su bstan tially more samples than GCWS hashing on b oth datase ts. Ag ain, for GCWS, th e r esu lts d o not diﬀer m uch once we use b ≥ 4. These results again conﬁr m the adv an tage of GCWS hashing. Figure 16 rep orts the test cl assiﬁcation accuracies on more d atasets, only f or b = 8 and k ≥ 128. Figure 17 pr esen ts the hashing results on 6 larger data sets for whic h w e can not d irectly train kernel SVMs. W e rep ort only for b = 8 and k up to 1204. All these results conﬁ rm that linearization via GCWS w orks w ell for the GMM k ern el. In co ntrast, the n ormalized random F ourier feature (NRFF) approac h t ypically requires subs tan tially more s amples (i.e., m uc h larger k ). This phen omenon can b e largely explained by the theoretical results in Theorem 3 and Theorem 4 , wh ic h conclud e that GCWS hashin g is more (considerably) accurate than NRFF hash ing, unless the similarit y is high. A t high similarit y , the v ariances of b oth hashing metho ds b ecome v ery small. W e should men tion that the original (tun ing-free) GMM kernel can b e mo d iﬁ ed b y int ro du cing tuning parameters. Th e original GCWS algorithm can b e sligh tly mo diﬁed to linearize the new (and tunable) GMM kernel. As sh o wn in [15], on many d atasets, the tunable GMM kernel can b e a strong comp etitor compared to computationally exp ensiv e algorithms such as deep nets or trees. 23 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) DailySports: b = 12 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) DailySports: b = 12 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) DailySports: b = 8 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) DailySports: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) DailySports: b = 4 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) DailySports: b = 4 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 40 60 80 100 C Accuracy (%) DailySports: b = 2 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) DailySports: b = 2 k = 128 k = 256 k = 1024 k = 4096 Figure 12: DailySports : T est classiﬁcat ion accuracie s of the linearized GMM k ernel (solid) and linearized RBF k ernel (dashed) , us ing LIBLINEAR. In eac h p anel, w e r ep ort the results on 4 diﬀeren t k (sample size) v alues: 1 28, 256, 1024, 4096 (righ t p anels), and 16, 32, 64, 128 (left panels). W e can see that the linearized RBF (using NRFF) would requir e subs tan tially more samples in order to reac h the same accuracies as the linearized GMM k ern el (using GCWS). 24 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) RobotNavi: b = 8 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) RobotNavi: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) RobotNavi: b = 4 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) RobotNavi: b = 4 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) RobotNavi: b = 2 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 50 60 70 80 90 100 C Accuracy (%) RobotNavi: b = 2 k = 128 k = 256 k = 1024 k = 4096 Figure 13: Rob otNavi : T est classiﬁca tion accuracies of the linearized GMM kernel (so lid) and linearized RBF k ernel (dashed), usin g LIBLINEAR. In eac h panel, w e rep ort th e results on 4 diﬀeren t k (sample size) v alues: 1 28, 256, 1024, 4096 (righ t p anels), and 16, 32, 64, 128 (left panels). W e can see that the linearized RBF (using NRFF) would requir e subs tan tially more samples in order to reac h the same accuracies as the linearized GMM k ern el (using GCWS). 25 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 C Accuracy (%) SEMG1: b = 12 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 40 45 C Accuracy (%) SEMG1: b = 12 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 C Accuracy (%) SEMG1: b = 8 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 40 45 C Accuracy (%) SEMG1: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 C Accuracy (%) SEMG1: b = 4 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 40 45 C Accuracy (%) SEMG1: b = 4 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 C Accuracy (%) SEMG1: b = 2 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 15 20 25 30 35 40 45 C Accuracy (%) SEMG1: b = 2 k = 128 k = 256 k = 1024 k = 4096 Figure 14: SEMG1 : T est classiﬁcation accuracies of the linearized GMM k ernel (solid) and linearized RBF k ernel (dashed), u sing LIBLINEAR. Again, we can see th at the linearized RBF w ould require sub stan tially more samples in order to reac h the same accuracies as the linearized GMM k ern el. Note that, for this dataset, the original RBF kernel actually outp erforms the original GMM k ernel as shown in T able 1. 26 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 30 40 50 60 70 C Accuracy (%) M−Rotate: b = 12 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 50 60 70 80 90 C Accuracy (%) M−Rotate: b = 12 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 30 40 50 60 70 C Accuracy (%) M−Rotate: b = 8 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 50 60 70 80 90 C Accuracy (%) M−Rotate: b = 8 k = 256 k = 1024 k = 4096 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 30 40 50 60 70 C Accuracy (%) M−Rotate: b = 4 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 50 60 70 80 90 C Accuracy (%) M−Rotate: b = 4 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 20 30 40 50 60 70 C Accuracy (%) M−Rotate: b = 2 k = 16 k = 32 k = 64 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 50 60 70 80 90 C Accuracy (%) M−Rotate: b = 2 k = 128 k = 256 k = 1024 k = 4096 Figure 15: M-Rotat e : T est classiﬁcat ion accuracies of the linearized GMM kernel (so lid) and linearized RBF kernel (dash ed) , using L I BLINEAR. Again, w e can see that the l inearized RBF w ould require sub stan tially more samples in order to reac h the same accuracies as the linearized GMM k ernel. F or M-Rotate, the origi nal RBF k ernel actually outp erforms the original GMM kernel as sho wn in T able 3 . 27 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 40 50 60 70 C Accuracy (%) Gesture: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 78 80 82 84 86 88 C Accuracy (%) Magic: b = 8 k = 128 256 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 92 94 96 98 100 C Accuracy (%) Musk: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 78 80 82 84 86 88 90 92 C Accuracy (%) Satimage: b = 8 k = 128 k = 256 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 82 84 86 88 90 92 94 96 C Accuracy (%) SpamBase: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 70 80 90 100 C Accuracy (%) Splice: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 30 40 50 60 70 80 C Accuracy (%) M−Noise5: b = 8 k = 128 k = 256 k = 1024 k = 4096 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 30 40 50 60 70 80 C Accuracy (%) M−Rand: b = 8 k = 128 k = 256 k = 1024 k = 4096 Figure 16: More Datasets : T est classiﬁcation accuracies of the linearized GMM k ern el (solid) and linearized RBF k ernel (dashed ) , usin g LIBLINEAR. T ypically , th e linearized RBF would require substan tially more samples in order to reac h the same accuracies as the li nearized GMM k ernel. 28 10 −2 10 −1 10 0 10 1 10 2 10 3 85 90 95 100 C Accuracy (%) IJCNN: b = 8 k = 128 k = 256 k = 512 k = 1024 10 −2 10 −1 10 0 10 1 10 2 10 3 70 80 90 100 C Accuracy (%) PAMAP101Large: b = 8 k = 128 k = 256 k = 512 k = 1024 10 −2 10 −1 10 0 10 1 10 2 10 3 70 80 90 100 C Accuracy (%) PAMAP105Large: b = 8 k = 128 k = 256 k = 512 k = 1024 10 −2 10 −1 10 0 10 1 10 2 10 3 79 80 81 82 83 84 85 C Accuracy (%) SensIT: b = 8 k = 128 k = 256 k = 512 k = 1024 10 −2 10 −1 10 0 10 1 10 2 10 3 60 70 80 90 100 C Accuracy (%) RCV1: b = 8 k = 128 k = 256 k = 512 k = 1024 10 −2 10 −1 10 0 10 1 10 2 10 3 85 90 95 100 C Accuracy (%) WebspamN1: b = 8 k = 128 k = 256 k = 512 k = 1024 Figure 17 : Larger Dat asets : T est cla ssiﬁcation accuracies of the linearize d GMM k ernel with GCWS (solid) and linearized RBF k ernel with NRFF (dashed ), using LIBLINEAR, on 6 larger datasets whic h w e can n ot directly compute k ernel SVM classiﬁers. The exp eriments again conﬁrm that GCWS h ash ing is subs tan tially more accurate than NRFF hashin g at the same s ample size k . 29 T raining time: F or linear algorithms, the training cost is largely determined by the num b er of nonzero en tries p er input data v ector. In other words, at the same k , the trainin g times of GCWS and NRFF will b e roughly comparable. F or GCWS and batc h algorithms (such as LIBINEAR), a larger b will increase the tr aining time but not muc h. See Figur e 18 for an example, which ac tually sho ws that NRFF will consume more time at high C (for ac hieving a go o d accuracy). Note that, with online lea rnin g, it would b e more obvious that the training time is determined b y the num b er of nonzeros and num b er of ep o c hes. F or indu strial pr actice, t ypically only one ep o ch or a f ew ep o ches are us ed. 10 −2 10 −1 10 0 10 1 10 2 10 3 10 1 10 2 10 3 C Train time (sec) PAMAP105Large: b = 2 k = 1024 k = 512 k = 256 k = 128 10 −2 10 −1 10 0 10 1 10 2 10 3 10 1 10 2 10 3 k = 128 k = 256 k = 512 k = 1024 C Train time (sec) PAMAP105Large: b = 4 10 −2 10 −1 10 0 10 1 10 2 10 3 10 1 10 2 10 3 k = 128 k = 256 k = 512 k = 1024 C Train time (sec) PAMAP105Large: b = 8 Figure 18: P AMAP105Large : T raining times of GCWS (solid curv es) and NRFF (dashed curv es), for four sample sizes k ∈ { 128 , 256 , 512 , 1024 } , an d b ∈ { 2 , 4 , 8 } . Data storage : F or GCWS, the storage cost p er data v ector is b × k wh ile the cost for NRFF w ould b e k × num b er of bits p er h ashed v alue (wh ic h might b e a large v alue suc h as 32). Therefore, at the same sample size k , GCWS w ill lik ely need less space to store the hashed data than NRFF. 6 Conclusion Large-scale machine learning has b ecome increasingly imp ortan t in pr actice. F or industrial applica- tions, it is often the case that only linear metho d s are aﬀordable. I t is thus practically b eneﬁcial to ha v e metho ds which can provide sub stan tially more accurate p rediction results than linear meth- o ds, with n o essential incr ease of the compu tation cost. The metho d of “random F ourier features” (RFF) has b een a p opular to ol for linearizing the radial basis fun ction (RBF) k ernel, with numerous applications in mac hine learning, computer vision, and b ey ond, e.g., [21, 27, 1, 7, 5 , 28, 8, 25, 4, 23]. In this pap er, we rigorously pro v e that a simp le normalizatio n step (i.e., NRFF) can substan tially impro v e the original RFF pr o cedure b y reducing the estimation v ariance. In this pap er, w e also prop ose the “generalized m in -max (GMM)” k ernel as a measur e of data similarit y , to eﬀec tiv ely capture data nonlin earit y . The GMM ke rnel can b e linearized via the gen- eralized consisten t weigh ted sampling (GCWS). Our exp erimenta l study demonstrates th at usu ally GCWS do es n ot need to o man y samples in order to achiev e goo d accuracies. In particular, GCWS t ypically r equ ires s u bstant ially fewer samples to reac h the same ac curacy as the normalized random F ourier f eature (NRFF) m etho d. T h is is practically imp ortan t, b ecause the trainin g (and testing) cost and storage cost are determined by the num b er of n on zeros (whic h is the num b er of samples in NRFF or GCWS) p er data vec tor of the dataset. The sup erb empirical p erformance of GCWS can b e large ly explained b y our theoretical analysis that the estimat ion v ariance of GCWS is t ypi- cally m uc h smaller than the v ariance of NRFF (ev en though NRFF h as imp r o v ed the original RFF). 30 By incorp orating tuning parameters, [15] d emonstrated that th e p erformance of the GMM k ernel and GCWS hash in g can b e fu rther imp ro v ed, in some d atasets remark ably so. See [15] for the comparisons with deep nets and trees. Lastly , we should also m ention that GCWS can b e n aturally applied in the context of eﬃcien t n ear n eighb or searc h , due to the d iscrete nature of th e samples, while NRFF or samples from Nystrom metho d can not b e directly used for building hash tables. References [1] R. H. Aﬀandi, E. F o x, and B. T ask ar. A ppr o ximate inf erence in con tin uous d eterminan tal pro cesses. In NIP S , pages 1430 –1438. 2013. [2] L. Bottou. h ttp://leon.b ottou.org/pro jects/sgd. [3] L. Bottou, O. Chap elle, D. De Coste, and J. W eston, editors. L ar ge-Sc ale Kernel M achines . The MIT Press, Cam bridge, MA, 2007. [4] K . P . Chwia lk o wski, A. Ramdas, D. Sejdinovic , and A. Gretto n. F ast tw o-sample testing with analytic representa tions of p r obabilit y measures. In NIP S , pages 1981 –1989. 2015. [5] B. Dai, B. Xie, N. He, Y. Liang, A. Ra j, M.-F. F. Balca n, and L. Song. S calable kernel method s via doubly sto c hastic gradients. In NIPS , pages 3041–30 49. 2014. [6] R.-E. F an, K.-W. Chang, C.-J. Hsieh, X.-R. W ang, and C.-J. Lin. Liblinear: A library for large linear classiﬁcation. J ournal of Machine L e arning R ese ar ch , 9:1871–187 4, 20 08. [7] J . M. Hern´ andez-Lobato , M. W. Hoﬀman, and Z. Ghahramani. Predictiv e entrop y searc h for eﬃcien t global optimization of blac k-b o x functions. In NIP S , pages 918– 926. 2014. [8] C .-J. Hsieh, S. Si, and I. S. Dhillon. F ast pr ed iction f or large-sca le k ernel mac h ines. In NIPS , pages 3689– 3697. 2014 . [9] S . Ioﬀe. Impr o v ed consisten t samp lin g, wei gh ted minhash and L 1 sk etc h ing. I n ICDM , pages 246–2 55, Sydn ey , A U, 2010. [10] H. L aro c helle, D. Erh an, A. C. Courville, J. Bergstra, and Y. Bengio. An empirical ev aluation of d eep architec tures on problems with man y factors of v ariation. In ICM L , p ages 473–4 80, Corv alis, Oregon, 2007. [11] P . Li. Robust logitb o ost and adaptive base class (ab c) logitb o ost. In UAI , 2010. [12] P . Li. 0-bit consisten t wei gh ted sampling. In KDD , Syd n ey , Australia, 2015. [13] P . Li. Nystrom metho d for appro ximating the gmm k ernel. T ec hnical rep ort, arXiv:1605 .05721, 2016. [14] P . Li. Generalized in tersection kernel. T ec hnical rep ort, arXiv:1612.09 283, 2017. [15] P . Li. T un able gmm k ernels. T ec h n ical rep ort, arXiv:17 01.0204 6, 2017. [16] P . Li, T. J. Ha stie, and K. W. Ch u rc h. Improvi ng r andom pro jections using m arginal infor- mation. In COL T , pages 635– 649, Pittsburgh, P A, 2006. 31 [17] P . Li, T. J. Hastie, and K. W. C h urch. V ery sp arse random pro jections. I n KD D , pages 287–2 96, Philadelphia, P A, 2006. [18] P . Li, A. Shr iv asta v a, J. Mo ore, and A. C. K¨ o nig. Hashing algo rithms for large- scale learning. In NIPS , Granada, Spain, 2011. [19] P . Li and C.-H. Zhang. Theory of the gmm k ernel. T ec hnical r ep ort, arXiv:1608.0 0550, 2016. [20] M. Manasse, F. McSherry , and K . T alw ar. Consistent we igh ted sampling. T ec hn ical R ep ort MSR-TR-2010-7 3, Microsoft Researc h , 2010. [21] M. Raginsky and S . Lazebnik. Lo calit y-sensitive bin ary co des f r om shift-in v arian t ke rnels. In NIPS , pages 1509–1 517. 2009. [22] A. Rahimi and B. Rech t. Random features for large-scale kernel mac hin es. In NIPS , 2007. [23] E. Ric hard , G. A. Goetz, and E. J . Ch ic hilnisky . Recognizing retinal ganglion cells in the dark. In NIPS , pages 2476 –2484. 2015. [24] W. Ru d in. F o urier Analysis on Gr oups . John Wiley & Sons, New Y ork, NY, 1990 . [25] A. Shah and Z. Ghahramani. Paralle l p redictiv e en trop y searc h f or batc h global optimizatio n of exp ensiv e ob jectiv e f unctions. I n NIPS , pages 3330–333 8. 2015. [26] D. J. Su therland and J. Sc hn eider. On the error of random fourier features. I n UAI , Amster- dam, The Netherlands, 2015. [27] T. Y ang, Y.-f. Li, M. Ma hdavi, R. Jin, an d Z.-H. Z h ou. Nystr¨ om method vs r an d om f ou r ier features: A theoretical and emp irical comparison. In N IPS , p ages 476–4 84. 2012. [28] I. E.-H. Y en, T.-W. Lin, S.-D. Lin, P . K. Ravikumar, and I. S. Dhillo n. Sparse random fea ture algorithm as co ordinate descent in h ilb ert space. In NIP S , pages 2456 –2464. 2014. 32

Linearized GMM Kernels and Normalized Random Fourier Features

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment