How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets
The computational complexity of kernel methods has often been a major barrier for applying them to large-scale learning problems. We argue that this barrier can be effectively overcome. In particular, we develop methods to scale up kernel models to s…
Authors: Zhiyun Lu, Avner May, Kuan Liu
Ho w to Scale Up Kernel Metho ds to Be As Go o d As Deep Neural Nets Zhiyun Lu 1 † , Avner Ma y 2 † Kuan Liu 1 ‡ , Alireza Bagheri Garak ani 1 ‡ , Dong Guo 1 ‡ , Aur ´ elien Bellet 4 ‡ ∗ Linxi F an 2 , Mic hael Collins 2 , Brian Kingsbury 3 , Mic hael Pichen y 3 , F ei Sha 1 ¶ 1 Dept. of Computer Science, U. of Southern California, Los Angeles, CA 90089 { zhiyunlu, kuanl, bagherig, dongguo, feisha } @usc.edu 2 Dept. of Computer Science, Colum bia Universit y , New Y ork, New Y ork 10027 { avnermay, mcollins } @cs.columbia.edu , lf2422@columbia.edu 3 IBM T. J. W atson Researc h Center, Y orkto wn Heigh ts, NY 10598 { bedk, picheny } @us.ibm.com 4 L TCI UMR 5141, T ´ el ´ ecom P arisT ech & CNRS, F rance aurelien.bellet@telecom-paristech.fr † and ‡ : shared first and second co-authorships, resp ectiv ely ¶ : to whom questions and commen ts should b e sen t August 10, 2016 Abstract The computational complexit y of kernel methods has often b een a ma jor barrier for applying them to large-scale learning problems. W e argue that this barrier can b e effectiv ely o vercome. In particular, w e dev elop metho ds to scale up k ernel mo dels to successfully tac kle large-scale learning problems that are so far only approac hable by deep learning arc hitectures. Based on the seminal w ork by [38] on appro ximating k ernel functions with features derived from random pro jections, we adv ance the state-of- the-art by prop osing methods that can efficiently train mo dels with hundreds of millions of parameters, and learn optimal represen tations from m ultiple kernels. W e conduct extensive empirical studies on problems from image recognition and automatic sp eec h recognition, and sho w that the p erformance of our kernel mo dels matches that of well-engineered deep neural nets (DNNs). T o the b est of our kno wledge, this is the first time that a direct comparison betw een these t wo metho ds on large-scale problems is reported. Our k ernel metho ds ha ve several app ealing properties: training with conv ex optimization, cost for training a single mo del comparable to DNNs, and significantly reduced total cost due to few er h yp erparameters to tune for mo del selection. Our contrastiv e study betw een these tw o very differen t but equally competitive mo dels sheds ligh t on fundamental questions such as how to learn go od represen tations. ∗ Most of the work in this pap er was carried out while the author was affiliated with Department of Computer Science, Universit y of Southern California. 1 1 In tro duction Deep neural net works (DNNs) and other t yp es of deep learning arc hitecture hav e made significant adv ances [3, 4]. In b oth w ell-b enc hmarked tasks and real-world applications, such as automatic sp eec h recognition [21, 34, 44] and image recognition [29, 48], deep learning architectures hav e achiev ed an unpreceden ted level of success and ha ve generated ma jor impact. Arguably , the most instrumental factors contributing to their success are: (1) learning from a huge amoun t of training data for highly complex mo dels with millions to billions of parameters; (2) adopting simple but effectiv e optimization methods such as sto c hastic gradient descent; (3) combatting ov erfitting with new sc hemes such as drop-out [23]; and (4) computing with massive parallelism on GPUs. New tec hniques as w ell as “tric ks of the trade” are frequently inv ented and added to the to olb o xes for machine learning researc hers and practitioners. In stark contrast, there ha ve b een many few er publicly known successful applications of k ernel metho ds (suc h as supp ort vector mac hines) to problems at a scale comparable to the sp eec h and image recognition problems tackled by DNNs. This is a surprising chasm, noting that kernel metho ds hav e b een extensively studied b oth theoretically and empirically for their p ow er of mo deling highly nonlinear data [43]. Moreo ver, the connection b et ween kernel methods and (infinite) neural netw orks has also been long noted [35, 51, 11]. Nonetheless, a common misconception is that it ma y b e difficult, if not imp ossible, for k ernel metho ds to catch up with deep learning metho ds in addressing large-scale learning problems. In particular, many k ernel-based algorithms scale quadratically in the n umber of training samples. This barrier in computational complexit y makes it esp ecially challenging for kernel metho ds to reap the b enefits of learning from a very large amoun t of data, while deep learning architectures are especially adept at it. W e con tend that this skepticism can b e sufficiently attenuated. Concretely , in this pap er, w e in vestigate and prop ose new ideas tailored for k ernel metho ds, with the aim of scaling them up to take on challenging problems in computer vision and automatic sp eec h recognition. T o this end, we build on the work b y [38] on approximating k ernel functions with features derived from random pro jections. Our innov ation is, ho wev er, to adv ance the state-of-the-art to a muc h larger scale. Concretely , we prop ose fast training metho ds for mo dels with hundreds of millions of parameters — these metho ds are necessary for classifiers using hundreds of thousands of features to recognize thousands of categories. W e also propose scalable metho ds for com bining multiple k ernels as wa ys of learning feature represen tations. In terestingly , we show multiplicativ e combination of kernels scale b etter than additive com binations. W e v alidate our approaches with extensive empirical studies. W e con trast k ernel models to DNNs on 4 large-scale b enc hmark datasets, some of which are often used to demonstrate the effectiveness of DNNs. W e sho w that the p erformance of large-scale kernel mo dels approaches or surpasses their deep learning coun terparts, which are either exhaustively optimized by us or are w ell-accepted as yardstic ks in industry standards. While pro viding a recip e to obtain state-of-the-art large-scale kernel mo dels, another imp ortan t contri- bution of our work is to shed light on new p ersp ectiv es and opp ortunities for future study . The techniques w e ha ve developed are easy to implement, readily repro ducible, and incur m uch less computational cost (for h yp erparameter tuning and mo del selection) than deep learning arc hitectures. Thus, they are v aluable to ols, tested and v erified to be effective for constructing comparative systems. Comparativ e studies enabled by such systems will, in our view b e indisp ensable in pursuing the higher goal of exploring and acquiring an understanding of how the tw o camps of metho ds differ, for instance in learning new representations of the original data 1 . As an example, we show that combining kernel mo dels and DNNs improv es ov er either individual mo del, suggesting that the t wo paradigms learn different yet complemen tary representations from the data. W e b eliev e that research in this line will offer deep insights, and broaden the theory and practice of designing alternativ e metho ds to b oth DNNs and kernel metho ds for large-scale learning. 1 Note this inquiry would b e informativ e only if both kernel metho ds and deep learning methods attain similar p erformance yet exploit different asp ects of data. 2 The rest of the pap er is organized as follows. W e briefly review related work in section 2. In section 3, we giv e a brief account of [38]. W e describ e our approaches in section 4. In section 5, we report extensive exper- imen ts comparing DNNs and k ernel metho ds on the problems in image and automatic sp eec h recognition. W e conclude and discuss future directions in section 6. 2 Related w ork The computational complexity of kernel metho ds, such as supp ort vector mac hines, dep ends quadratically on the n umber of training examples at training time and linearly on the num b er of training examples at the testing time. Hence, scaling up kernel metho ds has b een a long-standing and actively studied problem. [8] summarizes sev eral earlier efforts in this v ein. With clever implementation tric ks such as computation caching (for example, k eeping only a small p ortion of the very large kernel matrix inside the memory), earlier kernel machines can cop e with hundreds of thousands of samples [46, 17]. [7] provides an excellent account of v arious design considerations. T o further reduce the dep endency on the num b er of training samples, a more effective strategy is to activ ely select training samples [6]. An early version of this idea was reflected in the Sequential Minimal Optimization (SMO) algorithm [37]. With more sophistication, this technique was extended to enable training SVMs on 8 million samples [33]. Alternative approac hes exploit the equiv alence b et ween SVMs and sparse greedy approximation and solve SVMs approximately with a smaller subset of examples called coresets 2 [49, 12]. Exploiting structures of the kernel matrix can scale k ernel methods to 2 million to 50 million training samples [47]. Note that at the time of publication, none of the ab o ve-men tioned metho ds had been directly compared to DNNs. Instead of reducing the num b er of training samples, we can reduce the dimensionalit y of kernel features. In theory , those features are infinite dimensional. But for any practical problem, the dimensionality is b ounded ab o ve by the n umber of training samples. The main idea is then to directly use such features, after dimensionalit y reduction, to construct classifiers (i.e., solving the optimization problem of SVM in the primal space). Th us far, appro ximating kernels with finite-dimensional features has been recognized as a promising wa y of scaling up kernel metho ds. The most relev ant one to our pap er is the early observ ation by Rahimi and Rec ht that inner pro ducts b et w een features derived from random pro jections can b e used to approximate translation-in v ariant k ernels, a direct result of sp ectral analysis of p ositiv e functions [5, 43, 38]. Their follo w- up work of using those random features — weigh ted random kitchen sink [39] — is a ma jor inspiration to our w ork. Since, there has b een a gro wing interest in using random pro jections to approximate differen t kernels [25, 19, 31, 50]. F or example, [15] studied ho w to use random features for online learning. W e note that the amoun t of time for suc h classifiers to make a prediction depends linearly on the num b er of training samples. This could be a concern when the n umber of training samples is large. In spite of these progresses, there ha ve b een only a few reported large-scale empirical studies of those tec hniques on challenging tasks from sp eech recognition and computer vision, on which DNNs ha ve b een highly effective. In the context of automatic sp eec h recognition, examples of directly using k ernel metho ds w ere rep orted [18, 9, 24]. How ever, the tasks w ere fairly small-scale (for instance, on the TIMIT dataset). Moreo ver, none of them explores kernel learning as a wa y of learning new representations. In con trast, one ma jor asp ect of our work is to use m ultiple kernel learning to arriv e at new representations so as to reduce the gap betw een DNNs and k ernel metho ds, cf. section 4.2. 3 F eatures from random pro jections In what follo ws, w e describ e the basic idea we ha ve built up on to scale up kernel metho ds. The techn ique is based on explicitly and efficiently constructing features — they are generated randomly — whose inner 2 W e also exp erimen ted those techniques though we were not able to identify significant empirical success. 3 pro ducts then approximate kernel functions. Once such features are constructed, they can b e used as inputs b y any classifier. 3.1 Generate features by random pro jections Giv en a pair of data p oin ts x and z , a positive definite k ernel function k ( · , · ) : R d × R d → R defines an inner pro duct b etw een the images of the t wo data p oin ts under a (nonlinear) mapping φ ( · ) : R d → R M , k ( x , z ) = φ ( x ) T φ ( z ) (1) where the dimensionalit y M of the resulting mapping φ ( x ) can be infinite (in theory). Kernel metho ds a void inference in R M . Instead, they rely on the kernel matrix o ver the training samples. When M is far greater than N , the num b er of training samples, this trick provides a nice computational adv antage. How ever, when N is exceedingly large, this complexit y at the quadratic order of N b ecomes impractical. Rahimi and Rec h t lev erage a classical result in harmonic analysis and pro vide a fast wa y to appro ximate k ( · , · ) with finite -dimensional features [38]: Theorem 1. (Bo chner’s the or em, adapte d fr om [38]) A c ontinuous kernel k ( x , z ) = k ( x − z ) is p ositive definite if and only if k ( δ ) is the F ourier tr ansform of a non-ne gative me asur e. More specifically , for shift-in v ariant kernels suc h as Gaussian RBF and Laplacian kernels, k rbf = e −k x − z k 2 2 / 2 σ 2 , k lap = e −k x − z k 1 /σ (2) the theorem implies that the kernel function can b e expanded with harmonic basis, namely k ( x − z ) = Z R d p ( ω ) e j ω T ( x − z ) d ω = E ω h e j ω T x e − j ω T z i (3) where p ( ω ) is the densit y of a d -dimensional probabilit y distribution. The exp ectation is computed on complex-v alued functions of x and z . F or real-v alued kernel functions, ho wev er, they can b e simplified to the cosine and sine functions, see b elo w. F or Gaussian RBF and Laplacian kernels, the corresp onding densities are Gaussian and Cauc hy distri- butions: p rbf ( ω ) = N 0 , 1 σ I , p lap ( ω ) = Y d 1 π (1 + σ 2 ω 2 d ) (4) The harmonic decomp osition suggests a sampling-based approach of appro ximating the kernel function. Concretely , we draw { ω 1 , ω 2 , . . . , ω D } from the distribution p ( ω ) and use the sample mean to approximate k ( x , z ) ≈ 1 D D X i =1 φ ω i ( x ) φ ω i ( z ) = ˆ φ ( x ) T ˆ φ ( z ) (5) The r andom fe atur e ve ctor ˆ φ is th us comp osed of scaled cosines of random pro jections φ ω i ( x ) = r 2 D cos( ω T i x + b i ) (6) where b i is a random v ariable, uniformly sampled from [0 , 2 π ]. Details on the con vergence property of this appro ximation can b e found in [38]. A key adv antage of using approximate features ov er standard kernel metho ds is its scalability to large datasets. Learning with a represen tation ˆ φ ( · ) ∈ R D is relatively efficient pro vided that D is far less than the n umber of training samples. F or example, in our experiments (cf. section 5), we hav e 7 million to 10 million training samples, while D = 50 , 000 often leads to go od p erformance. 4 3.2 Use random features in classifiers Just as the standard k ernel methods (SVMs or kernelized linear regression) can b e seen as fitting data with linear mo dels in k ernel-induced feature spaces, w e can plug in the random feature vector ˆ φ ( x ) in just ab out an y (linear) mo del. In this pap er, we fo cus on using them to construct multinomial logistic regression. Sp ecifically , our model is a sp ecial instance of the weighte d sum of r andom kitchen sinks [39] p ( y = c | x ) = e w T c ˆ φ ( x ) P c e w T c ˆ φ ( x ) (7) where the lab el y can tak e any v alue from { 1 , 2 , . . . , C } . W e use multinomial logistic regression mainly b ecause it can deal with a large n umber of classes and provide p osterior probability assignments, needed b y the application task (i.e., the sp eech recognition systems, in order to combine with comp onen ts such as language models). 4 Our Approac hes T o scale up kernel metho ds, we address tw o challenges: (1) ho w to train large-scale mo dels in the form of eq. (7); (2) how to learn optimal k ernel functions adapted to the data. W e tac kle the former with a parallel optimization algorithm and the latter by extending the construction of random features initially prop osed in [38]. 4.1 P arallel optimization for large-scale k ernel mo dels While random features and weigh ted sum of random kitchen sinks hav e b een inv estigated b efore, there are few rep orted cases of scaling up to the problems commonly seen in automatic speech recognition and other domains. F or example, in our empirical studies of acoustic modeling (cf. section 5.3), the num b er of classes is C = 1000 and we often use more than D = 100 , 000 random features to comp ose ˆ φ ( x ). Thus, the linear mo del eq. (7) has a large n umber of parameters (ab out C × D = 10 8 ). W e hav e dev elop ed tw o ma jor strategies to ov ercome this challenge. First, we leverage the observ a- tion that fitting multinomial logistic regression is a conv ex optimization problem and adopt the metho d of sto c hastic a veraged gradien t (SAG) for its faster conv ergence, b oth theoretically and empirically , ov er sto c hastic gradient descent (SGD) [42]. Note that while SGD is widely applicable to b oth conv ex and non- con vex optimization problems, SAG is sp ecifically designed for con vex optimization and th us well-suited to our learning setting. Secondly , w e leverage the prop erty that random pro jections are just r andom – that is, giv en a D - dimensional ˆ φ ( x ), any random subset of it is still random. Our idea is then to train a model on each subset of features in p ar al lel and then assem ble them together to form a large mo del. Sp ecifically , for large D (say ≥ 100 , 000), we partition D into B blocks ˆ φ b ( x ) with each blo c k having a size of D 0 (sa y = 25 , 000). Note that eac h blo c k corresp onds to a differen t set of random pro jections sampled from the density p ( ω ). W e train B multinomial logistic regression mo dels and obtain B sets of parameters for each class, ie., { w b c , c = 1 , 2 , . . . , C } B b =1 . T o assemble them, we combine in the spirit of ge ometric me an of the probabilities (or arithmetic mean of the log probabilities) p ( y = c | x ) ∝ exp 1 B B X b =1 ˆ φ b ( x ) T w b c ! = B s Y b exp ˆ φ b ( x ) T b w b c (8) 5 Note that this assembled mo del can b e seen as a D -dimensional mo del with parameters of { 1 B w b c , c = 1 , 2 , . . . , C } B b =1 . W e sketc h the main argumen t for the v alidit y of this parallel training pro cedure, leaving a rigorous pro of for future w ork. The parameters of the w eighted sum of random kitc hen sink conv erges in O (1 / √ D ) to the true risk minimizer [39]. Thus, for each mo del of size D 0 , the pre-softmax activ ations (i.e., the logits) con verge in O (1 / √ D 0 ). F or B suc h mo dels, the arithmetic mean of logits conv erge in O (1 / ( √ B √ D 0 )) th us matc hing up the rate for a D -dimensional mo del. Our extensive empirical studies ha ve supp orted our argumen t — in virtually all training settings, the assembled mo dels cannot b e impro ved further, attaining the optim um of the corresponding D -dimensional model. 4.2 Learning kernel features Another adv antage of using kernels is to sidestep the problem of feature engineering, i.e., how to select the b est basis functions for a task at hand. Essentially , determining what kernel function to use implicitly sp ecifies the basis functions. But then the question b ecomes: how to select the b est k ernel function? One p opular paradigm to address the latter problem is multiple kernel learning (MKL) [30, 1, 13, 27]. That is, s tarting from a collection of base kernels, the algorithm identifies the bes t subset of them and com bines them together to b est adapt to the training data, analogous to designing the best features according to the data. In the following, we sho w how a few common MKL ideas can b enefit from the previously describ ed large- scale learning techniques (cf. section 3). While many MKL algorithms are formulated with k ernel matrices (and th us are not easily scalable to large problems), w e demonstrate how they can be efficien tly implemented with the general recip e of random feature approximation. Among them, we show an interesting and no vel result on combining kernels with Hadamard products, where the random feature appro ximation is esp ecially computationally adv antageous. In our empirical studies (detailed in Supplementary Material), we will sho w that MKL improv es metho ds using a single kernel, and even tually approaches the p erformance of deep neural netw orks. Thus, MKL presen ts an effective and computational tractable alternativ es to DNNs, even for large-scale problems. Additiv e Kernel Combination Giv en a collection of base kernels { k i ( · , · ) , i = 1 , 2 , . . . , L } , their non- negativ e combination k ( · , · ) = X i α i k i ( · , · ) (9) is also a k ernel function, pro vided α i ≥ 0 for any i . Supp ose each kernel k i ( · , ˙ ) is approximated with a D -dimensional random feature vector ˆ φ i ( · ), as in eq. (5). Then, given the linearity of the combination, the kernel function k ( · , · ) can be approximated by k ( · , · ) ≈ X i α i ˆ φ i ( · ) T ˆ φ i ( · ) = ˆ φ ( · ) T ˆ φ ( · ) (10) where ˆ φ ( · ) is just the concatenation of the √ α i -scaled ˆ φ i ( · ). Note that the dimensionalit y of ˆ φ ( · ) w ould b e L × D . There are several wa ys to exploit this approximation. The first wa y is to straightforw ardly plug ˆ φ ( · ) in to the multinomial logistic regression eq. (7) and optimize ov er L × D features. The second wa y is more scalable. F or each ˆ φ i ( · ), we learn an optimal mo del with parameters w i c for each class c . W e then learn a set of com bination co efficien ts α i b y optimizing the lik eliho o d mo del p ( y = c | x ) ∝ exp X i √ α i ˆ φ i ( x ) T w i c ! (11) while holding the other parameters fixed. This is a conv ex optimization with (presumably) a small set of parameters. 6 While the first approac h is more general, ho wev er, empirically , we do not observe a strong difference and ha ve adopted the second approach for its scalability . Multiplicativ e Kernel Combination Kernels can also be multiplicativ ely com bined from base k ernels: k ( · , · ) = L Y i =1 k i ( · , · ) (12) Note that this is a highly nonlinear combination [13]. Unlike the additiv e combination, to approximate the multiplicativ e com bination of k ernels, there do es not exist a simple form (such as concatenating) of comp osing with the appro ximate features of individual kernels. Nonetheless, W e hav e prov ed the following theorem as a w a y to constructing the appro ximate features for k ( · , · ) efficiently . Theorem 2. Supp ose al l k i ( · , · ) ar e tr anslation-invariant kernels such that k i ( x − z ) = Z R d p i ( ω ) e j ω T ( x − z ) d ω (13) Then k ( · , · ) is also tr anslation-invariant such that k ( x − z ) = Z R d p ( ω ) e j ω T ( x − z ) d ω (14) wher e the pr ob ability me asur e p ( ω ) is given by the c onvolution of al l p i ( ω ) p ( ω ) = p 1 ( ω ) ∗ p 2 ( ω ) ∗ · · · ∗ p L ( ω ) (15) Mor e over, let ω i ∼ p i ( ω ) b e a r andom variable dr awn fr om the c orr esp onding distribution, then ω = X i ω i ∼ p ( ω ) (16) Namely, to appr oximate k ( · , · ) , one ne e ds only to dr aw r andom variables fr om e ach individual c omp onent kernel’s c orr esp onding density, and use the sum of those variables to c ompute r andom fe atur es. The pro of of the theorem is in the Suppl.. W e note that ω and ω i ha ve the same dimensionalit y . Th us, the num b er of approximating features is indep enden t of the num b er of kernels, leading to a computational adv antage o v er additive combination. Kernel comp osition Kernels can also b e comp osited. Sp ecifically , if k 2 ( x , z ) is a kernel function that dep ends on only the inner pro ducts of its arguments, then k = k 2 ◦ k 1 is also a kernel function. A concrete example is when k 2 is the Gaussian RBF kernel and k 1 ( x , z ) = φ 1 ( x ) T φ 1 ( z ) for some mapping φ 1 ( · ) k ( x , z ) = exp {−k φ 1 ( x ) − φ 1 ( z ) k 2 2 /σ 2 } = exp {− [ k 1 ( x , x ) + k 1 ( z , z ) − 2 k 1 ( x , z )] /σ 2 } If we approximate b oth k 1 and k 2 using the random feature approximation of eq. (5), the comp osition w ould b e (graphically) equiv alen t to the followin g mapping, x ω ∼ p 1 ( ω ) − − − − − − → ˆ φ 1 ( x ) ω ∼ p 2 ( ω ) − − − − − − → ˆ φ 2 ( ˆ φ 1 ( x )) (17) namely , a one-hidden-lay er neural netw orks with the w eight parameters in the lay ers b eing completely ran- dom. As before, the result of the comp osite mapping ˆ φ 2 ◦ ˆ φ 1 can b e used in an y classifier as input features. 7 T able 1: Handwritten digit recognition error rates (%) Mo del k ernel DNN Augmen t training data no y es no yes On v alidation 0.97 0.77 0.71 0.62 On test 1.09 0.85 0.69 0.77 W e also generalize this op eration to in tro duce a linear pro jection to reduce dimensionality , serving as information bottleneck: ˆ φ 2 ◦ P ◦ ˆ φ 1 . W e exp erimen ted on t wo choices. First, P performs PCA (using the sample cov ariance matrix) on ˆ φ 1 ( x ). Note that this implies P ◦ ˆ φ 1 is an approximate kernel PCA on the original feature space x , using the kernel k 1 . Secondly , P performs sup ervise d dimensionality reduction. One simple c hoice is to implemen t Fisher discriminant analysis (FD A) on ˆ φ 1 ( x ), whic h is equiv alent to k ernel (FDA) on x . In our experiments, we hav e used a differen t procedure in a similar spirit. Sp ecifically , In particular, we first use ˆ φ 1 ( x ) as input features to build a multinomial logistic regression to predict its lab els. W e then perform PCA on the log-p osterior probabilities. Our c hoice here is largely due to the consideration of re-using the computations as w e often need to estimate the p erformance of k 1 ( · , · ) alone, thus the multinomial classifier built with k 1 ( · , · ) is readily usable. 5 EXPERIMENT AL RESUL TS W e v alidate our approaches of scaling up kernel metho ds on c hallenging problems in computer vision and automatic sp eec h recognition (ASR). W e conduct extensive empirical studies comparing kernel metho ds to deep neural netw orks (DNNs), which perform well in computer vision, and are state-of-the-art in ASR. W e sho w that kernel metho ds attain similarly comp etitiv e p erformance as DNNs – details in section 5.2 and 5.3 (as w ell as Supplemen tary Material). What can we learn from tw o very differen t, yet equally competitive, learning mo dels? W e rep ort our initial findings on this question (section 5.5). W e sho w that kernel metho ds and DNNs learn different yet complemen tary representation of the data. As suc h, a direct application of this observ ation is to combine them to obtain better mo dels than either indep enden tly . 5.1 General setup F or all kernel-based mo dels, we tune only three hyperparameters: the bandwidths for Gaussian or Laplacian k ernels, the n umber of random pro jections, and the step size of the (conv ex) optimization pro cedure (as adjusting it has a similar effect as early-stopping). F or all DNNs, we tune h yp erparameters related to b oth the arc hitecture and the optimization. This includes the n umber of lay ers, the num b er of hidden units in eac h lay er, the learning rate, the rate decay , the momentum, regularization, etc. W e also use unsupervised pre-training and tune hyperparameters for that phase too. Details ab out tuning those hyperparameters are describ ed in the Supplemen tary Material as they are often dataset-sp ecific. In short, model selection for kernel mo dels has significantly low er computational cost. W e give concrete measures to supp ort this observ ation in section 5.4. 5.2 Computer vision tasks W e exp erimen t on t wo problems: handwritten digit recognition and ob ject recognition. Handwritten digit recognition W e extract a dataset MNIST-6.7M, from the dataset MNIST-8M [33]. MNIST-8M is a transformed version of the original MNIST dataset [32]. Concretely , we randomly select 50,000 out of 60,000 images from the MNIST’s training set and extracted the corresp onding samples (the 8 T able 2: Ob ject recognition error (%) Mo del k ernel DNN Augmen t training data no y es no yes On v alidation 43.2 41.4 42.9 43.2 On test 43.9 42.2 43.3 44.0 original as well as transformed/distorted ones ) in MNIST-8M, resulting in a total of 6.75 million samples in total, as our training set. W e use the remaining 10,000 images from the original training set as a v alidation set — we purp osely a void using any transformed v ersions of those 10,000 images as a v alidation dataset to a void potential ov erfitting. W e rep ort test error rate on the standard 10,000 MNIST test set. W e also exp erimented with a data augmen tation tric k to increase the num b er of training samples during tr aining . Whenever we encoun ter a training sample, we corrupt it with masking noise (randomly flipping 1 to 0 in the binary image). W e crudely tune the mask-out rates, whic h are either 0.1, 0.2 or 0.3. T able 1 compares the performance of the b est single-kernel based classifier to that of the b est DNN. The k ernel classifier uses Gaussian kernel, and 150,000 random pro jections. The b est DNN has 4 hidden lay ers with 1000, 2000, 2000, and 2000 hidden units resp ectiv ely . The difference b et ween the kernel mo del and the DNN is small – ab out 16 (out of 10,000) misclassified images. Interestingly , on the test data, the k ernel mo del b enefits from the data augmentation tric k while the DNN does not. P ossibly , the DNN o verfits to the v alidation dataset. Ob ject recognition F or this task, we experiment on the database CIF AR-10 [28]. The dataset contains 50,000 training samples and 10,000 test samples. Each sample is a R GB image of 32 × 32 pixels in one of 10 ob ject categories. W e randomly pick ed 10,000 images from the training set for v alidation, keeping the remaining 40,000 images for training. W e did not p erform any prepro cessing to the images as we wan t to relate our findings to previously published results which often do not prepro cess data or do not rep ort all sp ecific details in prepro cessing. W e also exp erimented with an augmented version of the dataset, b y injecting Gaussian noise to images during training. T able 2 compares the performance of the b est single-kernel based classifier to that of the b est DNN. The k ernel classifier uses Gaussian kernel, and 4,000,000 random pro jections. The b est DNN has 3 hidden lay ers with 4000, 2000, 2000 hidden units resp ectiv ely . The best kernel model performs slightly better than the DNN. They both outp erform previously reported DNN results on this dataset, whose error rates are b et ween 44.4% and 48.5% [41, 28, 40]. Con volutional neural nets (CNNs) can significantly outp erform DNNs on this dataset. Ho wev er, we do not compare to CNNs as our k ernel mo dels (as w ell as DNNs) do not construct feature extractions with prior kno wledge, while CNNs are designed especially for ob ject recognition. 5.3 Automatic sp eech recognition (ASR) T ask and ev aluation metric Deep neural nets (DNNs) ha ve b een very successfully applied to ASR. There, DNNs p erform the task of acoustic mo deling. Acoustic modeling is analogous to the con ven tional m ulti-class classification, that is, to learn a predictive mo del to assign phoneme context-dependent phoneme state lab els y to short segments of speech, called frames, represented as acoustic feature v ectors x . Acoustic feature vectors are extracted from frames and their context windows (i.e., neighboring frames in temp oral pro ximity). Analogously , kernel-based multinomial logistic regression mo dels, as describ ed in section 3.2, are also used as acoustic models and compared to DNNs. Acoustic mo dels are often ev aluated in conjunction with other comp onen ts of ASR systems. In partic- ular, sp eec h recognition is inheren tly a sequence recognition problem. Th us, p erplexity and classification accuracies — commonly used for conv entional m ulti-class classification problems — provide only a proxy (and intermediate goals) to the sequence recognition error. T o measure the latter, a full ASR pip eline is 9 T able 3: Best T oken Error Rates on T est Set (%) Mo del Bengali Can tonese ibm 70.4 67.3 rbm 69.5 66.3 b est k ernel mo del 70.0 65.7 necessary where the p osterior probabilities of the phoneme states are combined with the probabilities of the language mo dels (of the interested linguistic units suc h as words) to yield the most probable sequence of those units. A best alignment with the ground-truth sequence is computed, yielding tok en error rates ( TER ). Giv en the inherent complexity , in what follo ws, w e summarize the empirical studies of applying b oth paradigms to the acoustic mo deling task. W e will rep ort TER on t wo different languages. Details are presen ted in the Supplemen tary Material, including comparisons in terms of b oth p erplexit y and accuracy for differen t mo dels. W e b egin by describing the datasets, follow ed b y a brief description of v arious k ernel and DNN models we hav e exp erimen ted with. Datasets W e use tw o datasets: the IARP A Babel Program Can tonese (IARP A-bab el101-v0.4c) and Ben- gali (IARP A-bab el103b-v0.4b) limited language packs. Eac h pac k contains a 20-hour training, and a 20-hour test sets. W e designate about 10% of the training data as a held-out set to b e used for mo del selection and tuning. W e follow the same pro cedure to extract acoustic features from raw audio data as in the previous work using DNNs for ASR [26]. In particular, w e hav e used IBM’s proprietary system A ttila which is adapted for the abov e-mentioned Bab el language pac ks. The acoustic features are 360-dimensional real-v alued dense v ectors. There are 1000 (non-ov erlapping) phoneme context-dependent state lab els for eac h language pack. F or Cantonese, there are about 7.5 million data points for training, 0.9 million for held-out, and 7.2 million for test, and on Bengali, 7.7 million for training, 1.0 million for held-out and 7.1 million for test. F or Bengali, the TER metric is the word-error-rate (WER) and for Cantonese, it is character-error-rate (CER). V arious of mo dels b eing exp erimen ted IBM’s A ttila ASR system has a DNN acoustic mo del that con tains five hidden-lay ers, each of which contains 1,024 units with logistic nonlinearities. W e refer to this system as ibm . W e ha ve developed another version of DNN, following the original Restricted Boltzman Ma- c hine (RBM)-based training procedure for learning DNNs[22]. Sp ecifically , the pre-training is unsup ervised. W e hav e trained DNNs with 1, 2, 3 and 4 hidden lay ers, and 500, 1000, and 2000 hidden units p er la yer (th us, 12 architectures p er language). W e refer to this system as rbm . F or kernel-based acoustic mo dels, we used Gaussian RBF, Laplacian kernels or some forms of combina- tions. The only hyper-parameter there to tune is the kernel bandwidth, whic h ranges from 0.3 - 5 median of the pairwise distances in the data (Typically , the median works well.), the num b er of random pro jec- tions ranging from 2,000 to 400,000 (though a stable p erformance is often observ ed at 25,000 or abov e). F or training with very large num b er of features, we us ed the parallel training pro cedure, describ ed in sec- tion 4.1. F or optimization, w e used the sto c hastic av erage gradient and tune the step size lo osely from 4 v alues { 10 − 4 , 10 − 3 , 10 − 2 , 10 − 1 } . Details about these systems are in Supplemen tary Material. Results T able 3 rep orts the best p erforming mo dels measured in TER . The RBM-trained DNN ( rbm ), whic h has 4 hidden la yers and 2000 units in each lay er, p erforms the b est on Bengali. But our b est kernel mo del, which uses Gaussian RBF kernel and 150,000 – 200,000 random pro jections, performs the b est on Can tonese. Both p erform b etter than IBM’s DNN. On Cantonese, the impro vemen t of the kernel model ov er ibm is noticeably substan tial (1 . 6% reduction in absolute). 10 T able 4: Combining best p erforming k ernel and DNN models Dataset MNIST-6.7 CIF AR-10 Bengali Can tonese Best single 0.69 42.2 69.5 65.7 Com bined 0.61 40.3 69.1 64.9 5.4 Computational efficiency In contrast to DNNs, k ernel mo dels can b e more efficiently dev elop ed. W e illustrate this on tw o asp ects: the computational cost of training a single mo del and the cost of mo del selection (i.e., hyperparameter tuning) Cost of training a single mo del While the amoun t of training time depends on sev eral factors, includ- ing the volume and the dimensionalit y of the dataset, the choice of h yp erparameters and their effect on con vergence, implementation details, etc. W e giv e a rough picture after con trolling those extraneous factors as m uch as p ossible. W e implement b oth metho ds with highly optimized Matlab co des (comparable to our CUDA C imple- men tation) and utilize a single GPU (NVidia T esla K20m). The timing results rep orted below are obtained from training acoustic models on the Bengali language dataset. F or a k ernel mo del with 25,000 random pro jections (25 million model parameters), con vergence is reached in less than 20 ep o c hs, with an a v erage of 15 min utes p er ep och. In contrast, a comp etitive deep mo del with four hidden lay ers of 2,000 hidden units (15 million parameters), if initialized with pretrained parameters, reac hes con vergence in roughly 10 epo c hs, with an a verage of 28 minutes p er ep o c h. (The pretraining requires additional 12 hours.) Th us, the training time for a single kernel model is ab out the same as that for a DNN. This holds for a range of datasets and configurations of hyperparameters. Cost of mo del selection The num b er of k ernel mo dels to be tuned, is significantly (at least one order in magnitude ) less than DNNs. There are only t wo hyperparameters to search when selecting k ernel mo dels: the kernel bandwidth and the learning rate. Generally , the higher the num b er of random pro jections, the b etter the p erformance is. F or DNNs, the num b er of h yp erparameters needed to b e tuned is substantially more. As previously men- tioned, in our exp erimen ts, w e tuned those related to the net work architecture and optimization procedure, for both pre-training and fine-tuning. As suc h, it is fairly common to select the best DNN among h undreds to thousands of them. Com bining b oth factors, kernel models are esp ecially app ealing when they are used to tackle new prob- lem settings where there is only a weak knowledge ab out what the optimal hyperparameters are or what the prop er ranges for those parameters are. T o develop DNNs in this scenario, one would b e forced to com binatorially adjust many knobs while kernel approaches are simple and straightforw ard. 5.5 Do kernel and deep mo dels learn the same thing? Giv en their matching p erformances, do k ernel and DNN models learn the same kno wledge from data? W e rep ort in the following our initial findings. W e first combine the b est p erforming mo dels from each paradigm — we use w eighted sum of pre-softmax activ ations (i.e., logits). T able 4 summarizes those results across 4 tasks w e hav e studied in the previous sections. In the ro w of “b est single”, blue color indicates the num b er is from a kernel mo del while red from a DNN. Clearly , com bining the tw o improv es either one indep enden tly . These impro vemen ts suggest that despite b eing close in error rates, the tw o mo dels are still different enough. W e gain more intuitiv e understanding b y visualizing what are b eing learn t by eac h mo del. T o this end, w e pro ject each mo del’s pre-softmax activ ations onto the 2-D plane with t-SNE. 11 -100 -50 0 50 100 -100 -50 0 50 100 kernel machine -50 0 50 -60 -40 -20 0 20 40 60 80 neural network Figure 1: t-SNE em b eddings of data represen tation learn t b y k ernel (Left) and by DNN (Right). The relativ e p ositioning of those well-separated clusters are different, suggesting that these tw o mo dels learn to represent data in quite different wa ys (Kernel embedding is computed using DNN’s embedding as initialization to a void the randomness in t-SNE.) Fig. 1 contrasts the em b eddings for 1000 samples from MNIST-6.7M’s test set. Each data point is a dot and the color enco des its lab el. Due to the low classification error rates, it is not surprising to find that there are 10 w ell separated clusters, one for eac h of the 10 digits. How ev er, the relativ e p ositioning of those clusters is noticeably different betw een DNN and kernel. It is not clear the embeddings can b e transformed into eac h other with linear transformations. This seems to suggest each metho d has its unique wa y of nonline arly em b edding data. Elucidating more precisely is our future research direction. 6 Conclusion W e prop ose techniques to scale up kernel metho ds to large learning problems that are commonly found in sp eec h recognition and computer vision. W e hav e sho wn that the p erformance of those large k ernel mo dels approac hes or surpasses their deep neural netw orks counterparts, which hav e b een regarded as the state-of- the-art. F uture direction of our res earc h include understanding the difference of these tw o camps of methods, for instance, in learning new representations of data. Ac kno wledgemen t F. S. is grateful to Lawrence K. Saul (UCSD), L ´ eon Bottou (Microsoft Researc h), Alex Smola (CMU), and Chris J. C. Burges (Microsoft Research) for many fruitful discussions and p oin ters to relev ant w ork. Computation for the work describ ed in this pap er was partially supported b y the Universit y of Southern Californias Cen ter for High-P erformance Computing (http://hpc.usc.edu). This work was supp orted by the In telligence Adv anced Researc h Pro jects Activity (IARP A) via Depart- men t of Defense U.S. Army Researc h Lab oratory (DoD / ARL) con tract num b er W911NF-12-C-0012. The U.S. Gov ernment is authorized to reproduce and distribute reprints for Gov ernmental purp oses not with- standing any copyrigh t annotation thereon. The views and conclusions con tained herein are those of the authors and should not b e interpreted as necessarily representing the official p olicies or endorsements, either expressed or implied, of IARP A, DoD/ARL, or the U.S. Go vernmen t. Additionally , A. B. G. is partially supp orted by a USC Prov ost Graduate F ellowship. F. S. is partially supp orted by a NSF I IS-1065243, a Go ogle Research Award, an Alfred. P . Sloan Research F ello wship and an AR O YIP Aw ard (W911NF-12-1-0241). 12 References [1] F rancis R. Bach, Gert R. G. Lanc kriet, and Michael I. Jordan. Multiple k ernel learning, conic duality , and the SMO algorithm. In Pr o c. of the 21th Intl. Conf. on Mach. L e arn. (ICML) , 2004. [2] Y. Bengio, D. Sch uurmans, J.D. Lafferty , C.K.I. Williams, and A. Culotta, editors. A dvanc es in Neural Infor- mation Pr o c essing Systems 22 , 2009. [3] Y oshua Bengio. Learning deep architectures for ai. F oundations and T r ends in Machine L e arning , 2(1):1–127, Jan uary 2009. [4] Y oshua Bengio, Aaron C. Courville, and P ascal Vincen t. Representation learning: a review and new p erspectives. IEEE T r ans. on Pattern A nal. & Mach. Intel l. , 35(8):1798–1828, 2013. [5] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigr oups . Springer, 1984. [6] L´ eon Bottou. P ersonal comm unication, 2014. [7] L´ eon Bottou and Chih-Jen Lin. Support v ector machine solvers. In Bottou et al. [8]. [8] L´ eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason W eston, editors. L ar ge Sc ale Kernel Machines . MIT Press, Cambridge, MA., 2007. [9] Chih-Chieh Cheng and B. Kingsbury . Arccosine kernels: Acoustic mo deling with infinite neural netw orks. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2011 IEEE International Confer enc e on , pages 5200–5203, 2011. [10] K. Cho, A. Ilin, and T. Raiko. Improv ed learning of gaussian-b ernoulli restricted b oltzmann mac hines. In Pr o ce e dings of the International Confer enc e on A rtificial Neur al Networks (ICANN 2011) , pages 10–17, 2011. [11] Y oungmin Cho and Lawrence K. Saul. Kernel metho ds for deep learning. In Bengio et al. [2], pages 342–350. [12] Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. A CM T r ans. Algorithms , 6(4):63:1–63:30, 2010. [13] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In Bengio et al. [2], pages 396–404. [14] Corinna Cortes, Neil Lawrence, and Kilian W ein b erger, editors. Advanc es in Neur al Information Pr oc essing Systems 27 , 2014. [15] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Ra j, Maria-Florina Balcan, and Le Song. Scalable k ernel methods via doubly stochastic gradien ts. In Cortes et al. [14]. [16] Sanjoy Dasgupta and David McAllester, editors. Pro c. of the 30th Int. Conf. on Mach. L e arn. (ICML) , v olume 28 of JMLR W & CP , 2013. [17] Dennis DeCoste and Bernhard Sc h¨ olk opf. T raining in v ariant support v ector mac hines. Mach. L e arn. , 46:161–190, 2002. [18] Li Deng, G¨ okhan T¨ ur, Xiao dong He, and Dilek Z. Hakk ani-T ¨ ur. Use of kernel deep conv ex netw orks and end-to- end learning for sp ok en language understanding. In 2012 IEEE Spoken L anguage T e chnolo gy Workshop (SL T), Miami, FL, USA, De c emb er 2-5, 2012 , pages 210–215, 2012. [19] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. In Dasgupta and McAllester [16], pages 19 – 27. [20] G. E Hinton. T raining pro ducts of exp erts by minimizing contrastiv e divergence. Neur al c omputation , 14(8): 1771–1800, 2002. 13 [21] Geoffrey Hin ton, Li Deng, Dong Y u, George E Dahl, Ab del-rahman Mohamed, Na vdeep Jaitly , Andrew Senior, Vincen t V anhouck e, P atrick Nguyen, T ara Sainath, and Brian Kingsbury . Deep neural netw orks for acoustic mo deling in sp eec h recognition: The shared views of four researc h groups. Signal Pr o c essing Magazine, IEEE , 29(6):82–97, 2012. [22] Geoffrey E. Hinton, Simon Osindero, and Y ee-Why e T eh. A fast learning algorithm for deep b elief nets. Neual Comp. , 18(7):1527–1554, 2006. [23] Geoffrey E. Hinton, Nitish Sriv asta v a, Alex Krizhevsky , Ilya Sutsk ever, and Ruslan R. Salakh utdinov. Improving neural net works b y preven ting co-adaptation of feature detectors. arXiv:1207.0580, July 2012. URL http: //arxiv.org/abs/1207.0580 . [24] Po-Sen Huang, Haim Avron, T ara N Sainath, Vik as Sindhw ani, and Bhuv ana Ramabhadran. Kernel metho ds matc h deep neural net works on TIMIT. In Pr o c. of the 2014 IEEE Intl. Conf. on A c ou., Sp e e ch and Sig. Pr o c. (ICASSP) , volume 1, page 6, 2014. [25] Purushottam Kar and Harish Karnick. Random feature maps for dot pro duct kernels. In Pr o c. of the 29th Intl. Conf. on Mach. L e arn. (ICML) , 2012. [26] Brian Kingsbury , Jia Cui, Xiaodong Cui, Mark J. F. Gales, Kate Knill, Jonathan Mamou, Lidia Mangu, Da vid Nolden, Mic hael Pic heny , Bhuv ana Ramabhadran, Ralf Schl¨ uter, Abhinav Sethy , and Phlip C. W o odland. A high-p erformance Cantonese keyw ord search system. In Pr o c. of the 2013 IEEE Intl. Conf. on A c ou., Sp e e ch and Sig. Pr o c. (ICASSP) , pages 8277–8281, 2013. [27] Marius Kloft, Ulf Brefeld, S¨ oren Sonnenburg, and Alexander Zien. l p -norm multiple kernel learning. Journal of Machine L e arning R ese ar ch , 12:953–997, 2011. [28] A. Krizhevsky and G. Hin ton. Learning m ultiple lay ers of features from tin y images, 2009. [29] Alex Krizhevsky , Ilya Sutsk ever, and Geoffrey E. Hinton. Imagenet classification with deep conv olutional neural net works. In P ereira et al. [36], pages 1097–1105. [30] Gert R. G. Lanckriet, Nello Cristianini, Peter L. Bartlett, Lauren t El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine L e arning R ese ar ch , 5:27–72, 2004. [31] Quo c Viet Le, T am´ as Sarl´ os, and Alexander Johannes Smola. F astfo od: Approximating kernel expansions in loglinear time. In Dasgupta and McAllester [16]. [32] Y. LeCun and C. Cortes. The mnist database of handwritten digits, 1998. [33] Ga¨ elle Lo osli, St´ ephane Canu, and L ´ eon Bottou. T raining inv ariant supp ort vector mac hines using selective sampling. In Bottou et al. [8]. [34] Ab del-rahman Mohamed, George Dahl, , and Geoffrey Hinton. Acoustic mo deling using deep b elief netw orks. IEEE T r ansactions on Audio, Sp e e ch, and L anguage Pr o c essing , 20(1):14–22, 2012. [35] R. Neal. Priors for infinite netw orks. T echnical Report CRG-TR-94-1, Dept. of Computer Science, Universit y of T oronto, 1994. [36] F. Pereira, C.J.C. Burges, L. Bottou, and K. Q. W einberger, editors. A dvanc es in Neur al Information Pro c essing Systems 25 , 2012. [37] John C. Platt. F ast training of support vector machines using sequen tial minimal optimization. In A dvanc es in Kernel Metho ds - Supp ort V e ctor L e arning . MIT Press, 1998. [38] Ali Rahimi and Benjamin Rech t. Random features for large-scale k ernel machines. In A dvanc es in Neur al Information Pr o c essing Systems 20 , pages 1177–1184, 2007. [39] Ali Rahimi and Benjamin Rech t. W eighted sums of random kitc hen sinks: Replacing minimization with ran- domization in learning. In A dvanc es in Neur al Information Pr o c essing Systems 21 , pages 1313–1320, 2008. 14 [40] T. Raiko, H. V alp ola, and Y. LeCun. Deep learning made easier by linear transformations in p erceptrons. In International Confer enc e on Art ificial Intel ligenc e and Statistics , pages 924–932, 2012. [41] S. Rifai, P . Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractiv e auto-enco ders: Explicit inv ariance during feature extraction. In Pr o ce e dings of the 28th International Confer enc e on Machine L e arning (ICML-11) , pages 833–840, 2011. [42] Nicolas L. Roux, Mark Schmidt, and F rancis R. Bac h. A stochastic gradient metho d with an exponential con vergence rate for finite training sets. In Pereira et al. [36], pages 2663–2671. [43] B. Sc h¨ olk opf and A. Smola. Le arning with kernels . MIT Press, 2002. [44] F rank Seide, Gang Li, Xie Chen, and Dong Y u. F eature engineering in context-dependent deep neural netw orks for con versational sp eec h transcription. In Automatic Sp e e ch R e c o gnition and Understanding (ASRU), 2011 IEEE Workshop on , pages 24–29, 2011. [45] F rank Seide, Gang Li, and Dong Y u. Conv ersational speech transcription using con text-dep enden t deep neural net works. In Pr o c. of Intersp e e ch , pages 437–440, 2011. [46] Alex Smola. Personal communication, 2014. [47] S¨ oren Sonnen burg and V o jtec h F ranc. COFFIN: A computational framew ork for linear SVMs. In Pr o c. of the 27th Intl. Conf. on Mach. Le arn. (ICML) , pages 999–1006, Haifa, Israel, 2010. URL http://www.icml2010. org/papers/280.pdf . [48] Christian Szegedy , W ei Liu, Y angqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincen t V anhouck e, and Andrew Rabino vich. Going deep er with con volutions. In Cortes et al. [14]. [49] Ivor W. Tsang, James T. Kwok, and P ak-Ming Cheung. Core vector machines: F ast SVM training on very large data sets. Journal of Machine L e arning Rese ar ch , 6:363–392, 2005. URL http://www.jmlr.org/papers/v6/ tsang05a.html . [50] A. V edaldi and A. Zisserman. Efficient additive k ernels via explicit feature maps. IEEE T r ans. on Pattern Anal. & Mach. Intel l. , 34(3):480–492, 2012. [51] C. K. I. Williams. Computing with infinite netw orks. In Advanc es in Neural Information Pr oc essing Systems 19 , pages 599–621, 1996. A Pro of of the Theorem 2 Theorem 2. Supp ose al l k i ( · , · ) ar e tr anslation-invariant kernels such that k i ( x − z ) = Z R d p i ( ω ) e j ω T ( x − z ) d ω Then k ( · , · ) = Q L i =1 k i ( · , · ) is also tr anslation-invariant such that k ( x − z ) = Z R d p ( ω ) e j ω T ( x − z ) d ω wher e the pr ob ability me asur e p ( ω ) is given by the c onvolution of p i ( ω ) s p ( ω ) = p 1 ( ω ) ∗ p 2 ( ω ) ∗ · · · ∗ p L ( ω ) Mor eover, let ω i ∼ p i ( ω ) be a r andom variable dr awn fr om the c orr esp onding distribution, then ω = X i ω i ∼ p ( ω ) Namely, to approximate k ( · , · ) , one ne eds only to dr aw r andom variables fr om e ach individual c omp onent kernel’s c orr espondin g density, and use the sum of those variables to c ompute r andom fe atur es. 15 Pr o of. Denote ∆ = x − z . F or translation-in v ariant kernel, we ha ve k i ( x , z ) = k i ( ∆ ) = Z ω i p i ( ω i ) e j ω T i ∆ d ω i The pro duct of the k ernels is, k ( · , · ) = L Y i =1 k i ( · , · ) = L Y i =1 k i ( ∆ ) = k ( ∆ ) , whic h is also translation-inv ariant. k ( ∆ ) = L Y i =1 Z ω i p i ( ω i ) e j ω T i ∆ d ω i = Z ω 1 ... ω L p 1 ( ω 1 ) . . .p L ( ω L ) e j ( P L i ω T i ) ∆ d ω 1 . . . d ω L ˜ ω = P L i ω i = Z ˜ ω Z ω 1 ... ω L − 1 p 1 ( ω 1 ) . . . p L − 1 ( ω L − 1 ) p L ( ˜ ω − L − 1 X i ω i )d ω 1 . . . d ω L − 1 # e j ˜ ω T ∆ d ˜ ω = Z p ˜ ω ( ˜ ω ) e j ˜ ω T ∆ d ˜ ω = Z p ˜ ω ( ˜ ω ) e j ˜ ω T ( x − z ) d ˜ ω = E ˜ ω [ φ ˜ ω ( x ) φ ˜ ω ( z ) ∗ ] W e ha ve used the fact (due to conv olution theorem) that Z ω 1 ... ω L − 1 p 1 ( ω 1 ) . . . p L − 1 ( ω L − 1 ) p L ( ˜ ω − L − 1 X i ω i )d ω 1 . . . d ω L − 1 = p 1 ( ω 1 ) ∗ p 2 ( ω 2 ) ∗ · · · ∗ p L ( ω L ) = p ˜ ω ( ˜ ω ) It means we ha ve found a new distribution p ˜ ω ( ˜ ω ) as the random pro jection generating distribution for the new kernel k ( · , · ) = L Y i =1 k i ( · , · ) . F rom the definition of ˜ ω , in order to sample from p ˜ ω ( ˜ ω ), we can simply use the sum of indep enden t samples from p i ( ω i ). B Detailed exp erimen tal Results B.1 Image recognition W e first provide details on our empirical studies on challenging problems in image recognition. 16 B.1.1 Handwritten digit recognition Dataset and Prepro cessing Dataset is describ ed in section 5.2 of the main text. W e scale the input b et ween [0 , 1) by dividing 256. Kernel W e use Gaussian RBF and Laplacian kernels with k ernel bandwidth selected from { 1 , 1 . 5 , 2 , 2 . 5 , 3 }× median of the pairwise distance in data. W e select the learning rate from { 5 × 10 − 4 , 10 − 3 , 5 × 10 − 3 , 10 − 2 , 5 × 10 − 2 , 10 − 1 } . The random feature dimension we ha ve used is 150,000. Performance with differen t dimension is shown in T able 5. DNN W e trained DNNs with 1 , 2 , 3 , or 4 hidden lay ers, with 1000, 2000, 2000 and 2000 hidden units resp ectively . W e firstly pre-trained 1 Gaussian-Bernoulli and 3 consecutiv e Bernoulli restricted Boltzmann machines (RBMs), all using Sto c hastic Gradien t Descent (SGD) with Con trastive Divergence (CD-1) algorithm [20]. W e select learning rate from { 10 − 1 , 1 . 5 × 10 − 1 } , momen tum from { 0.5, 0.9 } and set L2 regularization to 2 × 10 − 4 for 2 epo c hs of pretraining. In finetuning, w e tune SGD with learning rate from { 5 × 10 − 3 , 10 − 2 , 5 × 10 − 2 , 10 − 1 , 5 × 10 − 1 } , momen tum from { 0.7, 0.9 } . W e decrease the learning rate by a factor of 0.99 for ev ery ep och and set mini-batc h size to 100, L2 regularization to 0. W e use early-stopping to con trol o verfitting. When trained with data augmentation, w e use smaller learning rate and run for more ep ochs. Data Augmentation W e use mask-out noise with ratio { 0.1, 0.2, 0.3 } for b oth kernel methods and DNN. Results T able 6 compares the p erformance of kernel metho ds to deep neural nets of differen t arc hitectures. The b est result of DNN is a 4-hidden-la yer neural net work. Deep nets generally ha ve sligh tly smaller test errors. Kernel mo dels b enefit more from data augmentation and achiev e similar error rates. T able 5: Kernel Metho ds on MNIST-6.7M (error rates %) Kernel type Data aug. 10K 50K 100K 150K Gaussian No 1.45/1.42 1.03/1.25 0.98/1.12 0.97/1.09 Laplacian No 1.93/1.93 1.21/1.34 1.16/1.17 1.10/1.13 Gaussian Y es - 0.83/1.03 0.79/0.92 0.77/0.85 T able 6: DNN on MNIST-6.7M (error rates %) Mo del Original Augmen ted V alidation T est V alidation T est k ernel 0.97 1.09 0.77 0.85 4 hidden 0.71 0.69 0.64 0.80 3 hidden 0.78 0.73 0.74 0.77 2 hidden 0.76 0.71 0.64 0.79 1 hidden 0.84 0.95 0.79 0.76 PCA Em b edding In addition to t-SNE visualization, w e also pro ject eac h model’s pre-softmax activ ation on to the first tw o principle comp onen ts given b y PCA. Fig. 2 con trasts the PCA embeddings for 1000 samples from MNIST-6.7M’s test set. Neural netw ork seems to give more spread out embeddings compared to kernel mac hine. The most noticeable class is digit 3 (light blue in low er left corner) and digit 6 (green on the righ t). B.1.2 Ob ject Recognition Dataset and Prepro cessing Dataset is describ ed in section 5.2 of the main text. W e scale the input b et ween [0 , 1) b y dividing 256. No other preprocessing is used b ecause w e w ould lik e to relate to previously reported results on DNNs on this data where preprocessing is not applied. 17 -50 0 50 -60 -40 -20 0 20 40 60 80 0 1 2 3 4 5 6 7 8 9 kernel machine -40 -20 0 20 40 -40 -30 -20 -10 0 10 20 30 40 50 0 1 2 3 4 5 6 7 8 9 neural network Figure 2: Embeddings of data b y kernel (Left) and by DNN (Rights). Kernel Gaussian k ernel is used. W e ac hieve the b est p erformance b y using 4,000,000 random features. This is done b y training 200K single models in parallel and then combine. T able 7 shows the performance of k ernel with resp ect to differen t n umber of random features. Similarly , w e select bandwidth from { 0.5,1,1.5,2,3 } × median distance and learning rate from { 5 × 10 − 4 , 10 − 3 , 5 × 10 − 3 , 10 − 2 , 5 × 10 − 2 , 10 − 1 } . DNN W e trained DNNs with 1 to 4 hidden lay ers, with 2000 hidden units p er lay er. In pretraining, we use a Gaussian RBM for the input la yer and three Bernoulli RBMs for intermediate hidden lay ers using CD-1 algorithm. (W e adopt the parameterization of GRBM in [10], whic h sho ws b etter p erformance. ) F or Gaussian RBM, we tune learning rate from { 5 × 10 − 5 , 10 − 4 , 2 × 10 − 4 } , momen tum from { 0.2, 0.5, 0.9 } 3 , and L2 regularization from { 2 × 10 − 5 , 2 × 10 − 4 } . F or Bernoulli RBM, we tune learning rate from { 10 − 2 , 2 . 5 × 10 − 2 , 5 × 10 − 2 } , momen tum from { 0.2, 0.5, 0.9 } , and L2 regularization is { 2 × 10 − 5 } . In finetuning, w e tune SGD with learning rate from { 4 × 10 − 2 , 8 × 10 − 2 , 1 . 2 × 10 − 1 } , momen tum from { 0.2, 0.5, 0.9 } , decrease the learning rate by a factor of 0.9 for every 20 or 50 ep ochs and set mini-batc h size to 50, L2 regularization to 0. W e use early-stopping to con trol o verfitting. When trained with data augmentation, we use smaller learning rate and run for more ep ochs. W e used constant learning rate throughout 30 ep o c hs and up date momentum after 5 ep ochs. In finetuning stage, w e used sto c hastic gradien t descen t with 0.9 momen tum, fixed learning rate schedule, decreasing the learning rate by 10 after 50 epo c hs. The optimal mo del is selected according to the classification accuracy on v alidation dataset. W e trained DNNs with 1, 2, 3 and 4 hidden lay ers with 2000 hiddenunits per lay er. Overfitting was observ ed after we increased mo del from 3 hidden lay ers to 4 hidden la yers. Data Augmen tation W e apply additive Gaussian noise with standard deviation { 0.1, 0.2, 0.3 } on raw pixel for b oth kernel metho ds and DNN. Results T able 8 contrasts results of k ernel mo dels and DNN. W e observe ov erfitting for 4 hidden lay er neural net work, and achiev e b est results in a 3 hidden lay er architecture. Deep er mo dels start to ov erfit and give w orse v alidation and test p erformance. Kernel models ac hieve the best error when data augmen tation is used. B.2 Automatic sp eech recognition In what follows, we provide comprehensive details on our empirical studies on challenging problems in automatic sp eec h recognition. B.2.1 T asks, datasets and ev aluation metrics T ask W e hav e selected the task of acoustic mo deling, a crucial comp onen t in automatic sp eec h recognition. In its most basic form, acoustic mo deling is analogous to the con ven tional multi-class classification, that is, to learn 3 Momentum can b e increased to another choice from the three after 5 ep ochs. 18 T able 7: Kernel Metho ds on CIF AR-10 (error rate%) Gaussian r.f. Original Augmen ted V alidation T est V alidation T est 200K 43.74 44.48 42.15 43.13 1M 43.43 44.08 41.62 42.38 2M 43.26 44.04 41.47 42.26 4M 43.22 43.93 41.36 42.23 T able 8: DNN on CIF AR-10 (error rate %) Mo del Original Augmen ted V alidation T est V alidation T est k ernel 43.22 43.93 41.36 42.23 4 hidden 43.21 43.74 43.00 43.38 3 hidden 42.89 43.29 42.93 43.35 2 hidden 43.30 43.76 43.80 44.81 1 hidden 48.40 48.94 47.28 47.79 a predictive mo del to assign phoneme con text-dep enden t state lab els to short segments of speech, called frames. While sp eech signals are highly non-stationary and context-sensitiv e, acoustic mo deling addresses this issue by using acoustic features extracted from context windows (i.e., neigh b oring frames in temp oral proximit y) to capture the transien t c haracteristics of the signals. Data c haracteristics T o this end, w e use tw o datasets: the IARP A Bab el Program Cantonese (IARP A-bab el101- v0.4c) and Bengali (IARP A-bab el103b-v0.4b) limited language packs. Eac h pack contains a 20-hour training, and a 20-hour test sets. W e designate ab out 10% of the training data as a held-out set to b e used for mo del selection and tuning (i.e., tuning hyperparameters etc). The training, held-out, and test sets contain different sp eak ers. The acoustic data is very c hallenging as it is t wo-person con versations b et ween p eople who kno w eac h other w ell (family and friends) recorded o ver telephone c hannels (in most cases with mobile telephones) from sp eak ers in a wide v ariety of acoustic environmen ts, including mo ving vehicles and public places. As a result, it contains man y natural phenomena suc h as mispronunciations, disfluencies, laughter, rapid sp eech, background noise, and channel v ariability . Compared to the more familiar TIMIT corpus, which contains ab out 4 hours of training data, the Bab el data is substantially more challenging b ecause the TIMIT data is read sp eec h recorded in a w ell-controlled, quiet studio environmen t. As is standard on previous work using DNNs for sp eec h recognition, the data is prepro cessed using Gaussian mixture models to give alignmen ts b et ween phoneme state lab els and 10-millisecond-frames of sp eec h [26]. The acoustic features are 360-dimensional real-v alued dense vectors. There are 1000 (non-o verlapping) phoneme context- dep enden t state labels for each language pac k. F or Cantonese, there are ab out 7.5 million data points for training, 0.9 million for held-out, and 7.2 million for test, and on Bengali, 7.7 million for training, 1.0 million for held-out and 7.1 million for test. Ev aluation metrics W e will be reporting 3 ev aluation metrics, typically found in mainstream sp eec h recognition researc h. P erplexity Given a set of examples, { ( x i , y i ) , i = 1 . . . m } , the perplexity is defined as p erp = exp ( − 1 m m X i =1 log p ( y i | x i ) ) The p erplexit y measure is low er b ound by 1 when all predictions are p erfect: p ( y i | x i ) = 1 for all samples. With random guessing p ( y i | x i ) = 1 /C , where C is the n umber of classes, the p erplexity attains C . W e use the perplexity measure on the held-out for mo del selection and tuning. This is b ecause the perplexity is often found to be correlated with the next tw o performance measures. 19 Accuracy The classification accuracy is defined as acc = 1 m m X i =1 1 y i = arg max y ∈ 1 , 2 ,...,C p ( y | x i ) T oken Error Rate (TER) Sp eec h recognition is inherently a sequence recognition problem. Th us, p erp and acc provide only proxy (and intermediate goals) to the sequence recognition error. T o measure the latter, a full automatic sp eec h recognition pip eline is necessary where the posterior probabilities of the phoneme lab els p ( y | x ) are combined with the probabilities of the language mo dels (of the in terested linguistic units suc h as w ords) to yield the most probable sequence of those units. A best alignment with the ground-truth sequence is computed, yielding token error rates. F or Bengali, the token error rate is the word-error-rate (WER) and for Can tonese, it is character-error-rate (CER). Because it en tails p erforming speech recognition, obtaining TER is computationally costly th us it is rarely used for model selection and tuning. Note also that the tok en error rates obtained on the Bab el tasks are m uch higher than those are reported for other conv ersational sp eec h tasks such as Switch b oard or Broadcast News. This is b ecause we ha ve muc h less training data for Bab el than for the other tasks. This low-r esour c e setting is an imp ortan t one in the sp eec h pro cessing area, given that there are a large n umber of languages in the world for which speech and language mo dels do not currently exist. B.2.2 Deep neural nets acoustic mo dels There are many v ariants of DNNs tec hniques. W e ha ve decided to c ho ose t wo flav ors that are v ery differen t in learning from data, in order to hav e a broader comparison. In either case, our mo del tuning is extensive. IBM’s DNN W e hav e used IBM’s proprietary system A ttila for the conv entional speech recognition that is adapted for the ab o ve-men tioned Bab el task. A detailed description app ears in [26]. Attila con tains a state-of-the-art acoustic model provided b y IBM. It also pow ers our full ASR pipeline in order to compute tok en error rate ( TER ). W e ha ve also used it to conv ert ra w sp eec h signals in to acoustic features. Concretely , the features at a frame is a 40-dimensional sp eak er-adapted representation that has previously b een shown to work well with DNN acoustic mo dels [26]. F eatures at 8 neigh b oring contextual frames are concatenated, yield 360-dimensional features. W e hav e used the same features for our k ernel metho ds. IBM’s DNN acoustic mo del contains fiv e hidden-lay ers, each of whic h contains 1,024 units with logistic nonlin- earities. The output is a softmax nonlinearity with 1,000 targets that corresp ond to quinphone context-dependent HMM states clustered using decision trees. All lay ers in the DNN are fully connected. The training of the DNN o ccurs in tw o stages. First, a greedy lay er-wise discriminative pretraining [45] to set the weigh ts for each lay er in a reasonable range. Then, the cross-entrop y criterion is minimized with resp ect to all parameters in the netw ork, using stochastic gradient descent with a mini-batch size of 250 samples, without momentum, and with annealing the learning rate based on the reduction in cross-entrop y loss on a held-out set. RBM-DNN W e hav e designed another version of DNN, following the original Restricted Boltzman Machine (RBM)-based training procedure for learning DNNs[22]. Sp ecifically , the pre-training is unsup ervised. W e hav e trained DNNs with 1, 2, 3 and 4 hidden lay ers, and 500, 1000, and 2000 hidden units p er lay er (thus, totally 12 arc hitectures per language). The first hidden lay er is a Gaussian RBM and the upp er lay ers are Binary-Bernoulli RBM. In pre-training, we use 5 ep ochs of SGD with Contrastiv e Divergence (CD-1) algorithm on all training data. W e tuned 3 h yp er-parameters, whic h are learning rate, momentum, and the strength for an ` 2 regularizer. F or fine-tuning, we used error back- propagation. W e tuned the initial learning rate, learning rate decay , momen tum and the strength for another ` 2 regularizer. The fine-tuning usually conv erges in 10 ep ochs. B.2.3 Kernel acoustic mo dels The developmen t of kernel acoustic mo dels do es not require combinatory searching o ver many factors. W e exp eri- men ted only tw o t yp es of kernels: Gaussian RBF and Laplacian kernels. The only hyper-parameter there to tune is the kernel bandwidth, which ranges from 0.3 - 5 median of the pairwise distances in the data. (Typically , the median w orks w ell.) 20 T able 9: Best p erplexit y and accuracy by different mo dels (see texts for description of different models) Bengali Can tonese Mo del p erp acc (%) p erp acc (%) ibm 3.4/3.5 71.5/71.2 6.8/6.16 56.8/58.5 rbm 3.3/3.4 72.1/71.6 6.2/5.7 58.3/59.3 1-k 3.7/3.8 70.1/69.7 6.8/6.2 57.0/58.3 a-2-k 3.6/3.8 70.3/70.0 6.7/6.0 57.1/58.5 m-2-k 3.7/3.8 70.3/69.9 6.7/6.1 57.1/58.4 c-2-k 3.5/3.6 71.0/70.4 6.5/5.7 57.3/58.8 T able 10: Performance of rbm acoustic mo dels Bengali Can tonese (h, L) perp acc (%) p erp acc (%) (1 , 500) 3.9/3.9 69.2/69.3 7.1/6.4 55.8/57.4 (2 , 500) 3.5/3.6 70.9/70.7 6.6/6.1 57.3/58.4 (3 , 500) 3.5/3.5 71.2/70.9 6.4/5.9 57.7/58.6 (4 , 500) 3.4/3.5 71.2/70.8 6.4/5.9 57.5/58.7 (1 , 1000) 3.7/3.7 70.1/70.1 6.8/6.2 56.4/58.0 (2 , 1000) 3.4/3.4 71.6/71.4 6.3/5.8 58.2/59.0 (3 , 1000) 3.4/3.5 71.7/71.3 6.3/5.7 58.0/59.2 (4 , 1000) 3.3/3.5 71.8/71.4 6.6/5.8 57.1/58.6 (1 , 2000) 3.6/3.7 70.5/70.3 6.7/6.1 56.9/58.1 (2 , 2000) 3.4/3.4 71.8/71.4 6.2/5.7 58.3/59.3 (3 , 2000) 3.4/3.5 71.5/71.2 6.2/5.6 57.8/59.1 (4 , 2000) 3.3/3.4 72.1/71.6 6.4/5.8 57.8/59.1 The random feature dimensions we ha ve used ranging from 2,000 to 400,000, though a stable p erformance is often observ ed at 25,000 or ab ov e. F or training with very large num b er of features, we used the parallel training pro cedure, describ ed in section 4.1 of the main text. All kernel acoustic models are multi-nomial logistic regression, thus optimized b y con vex optimization. As men- tioned in section 4.1 of the main text, we use Sto chastic Average Gradient (SAG), which efficien tly leverages the con vexit y property . W e do tune the step size, selected from a lo ose range of 4 v alues { 10 − 4 , 10 − 3 , 10 − 2 , 10 − 1 } . F or additive and multiplicativ e kernel com binations, we combine only t wo, one Gaussian and the other Laplacian. F or additive com binations, we first train t wo models, one for each k ernel. The combining coefficient α is selected from 0 . 1 , 0 . 2 , . . . , 0 . 9. F or comp osite k ernels, w e comp ose Gaussian with Laplacian. W e p erform a supervised dimensionality reduction, as describ ed in section 4.2 of the main text. The reduced dimensionality is chosen from 50, 100, or 360. The first kernel’s bandwidth is greedily selected to b e optimal as a single-kernel acoustic mo del. The other kernel’s bandwidth is selected after comp osing the features. B.2.4 Results on Perplexit y and Accuracy T able 9 concisely contrasts the b est p erplexity and accuracy attained by v arious systems: ibm (IBM’s DNN), rbm (RBM-trained DNN), 1-k (single kernel based model), a-2-k (additive combination of tw o kernels), m-2-k (multiplica- tiv e combination of t wo k ernels) and c-2-k (comp osite of tw o kernels). W e rep ort the metrics on both the held-out and the test datasets (the num b ers are separated by a /). In general, the metrics are consistent across b oth datasets and p erp correlates with a cc reasonably w ell. On Bengali, across all systems, the rbm attains the b est p erplexit y (red colored num b ers in the table), outp er- forming ibm and suggesting that unsup ervised pre-training is adv antageous. The b est p erforming kernel mo del is c-2-k , trailing sligh tly b ehind rbm and ibm . Similarly , on Cantonese, rbm p erforms the best, follo wed by c-2-k , both outp erforming ibm . As an illustrate example, we sho w in T able 10 the p erformance of rbm on Bengali, under different types of architectures ( h is the 21 T able 11: Performance of single Laplacian kernel Bengali Can tonese Dim p erp acc (%) p erp acc (%) 2k 4.4/4.4 66.5/66.8 8.5/7.4 52.7/54.8 5k 4.1/4.2 67.8/67.8 7.8/7.0 53.9/56.0 10k 4.0/4.1 68.4/68.3 7.5/6.7 54.9/56.6 25k 3.8/3.9 69.2/69.0 7.1/6.4 55.9/57.3 50k 3.8/3.9 69.7/69.4 6.9/6.2 56.5/57.9 100k 3.7/3.8 70.0/69.6 6.8/6.2 56.8/58.2 200k 3.7/3.8 70.1/69.7 6.8/6.2 57.0/58.3 T able 12: Best T oken Error Rates on T est Set (%) Mo del Bengali Cantonese ibm 70.4 67.3 rbm 69.5 66.3 1-k 70.0 65.7 a-2-k 73 68.8 m-2-k 72.8 69.1 c-2-k 71.2 68.1 n umber of hidden lay ers and L the num b er of hidden units). Meanwhile, in T able 11, we sho w the p erformance of single Laplacian k ernel acoustic model with differen t n umber of random features. Con trasting these t wo tables, it is in teresting to observe that kernel models use far more parameters than DNNs to achiev e similar perplexity and accuracy . F or instance, for a rbm with ( h = 1 , L = 500) with a p erplexit y of 3 . 9, the n umber of parameters is 360 × 500 + 500 × 1000 = 0 . 68 million. This is a fraction of a comparable k ernel mo del with Dim= 10k × 1000 = 10 million parameters. In some wa y , this ratio provides an intuitiv e measure of the price b eing con venien t, i.e., using random features in kernel mo dels instead of adapting features to the data as in DNNs. B.2.5 Results on T oken Error Rates T able 12 reports the p erformance of v arious mo dels measured in TER , another imp ortant and more relev ant metric to sp eec h recognition errors. Note that the RBM-trained DNN ( rbm ) p erforms the b est on Bengali, but our b est kernel model p erforms the b est on Cantonese. Both p erform b etter than IBM’s DNN. On Cantonese, the improv emen t of our kernel mo del ov er ibm is noticeably large (1 . 6% reduction in absolute). T able 13 highlights several interesting comparison betw een rbm and k ernel models. Concretely , it seems that DNNs need to b e big enough in order to reach the pro ximity of its best TER . On the other end, the kernel mo dels’ p erformance plateaus rather quic kly . This is the opposite to what w e hav e observ ed when w e compare tw o metho ds using p erplexit y and accuracy . One possible explanation is that for differen t mo dels, the relationship b et ween p erplexity and TER are differen t. This is certainly plausible, given TER is highly complex to compute and t wo different mo dels might explore parameter spaces very differently . Another p ossible explanation is that these tw o different mo dels learn different representations that bias either to ward p erplexity or tow ard TER . T able 14 suggests that this might indeed b e true: as we combine tw o different mo dels, we see handsome gains in p erformance ov er each individual one’s. B.2.6 DNN and kernels learn complemen tary representations Inspired by what we ha ve observed in the previous section, we set out to analyze in what wa y the representations learn t b y tw o differen t mo dels might be complemen tary . W e hav e obtained preliminary results. W e to ok a learned DNN (we used the b est p erform one in terms of TER ) and computed its pre-activ ation to the output la yer, whic h is a linear transformation of the last hidden lay er’s outputs. F or the best performing single- 22 T able 13: Detailed Comparison on TER for Bengali Mo del Arc h. TER (%) rbm h = 2 , L = 1000 73.1 rbm h = 3 , L = 1000 72.7 rbm h = 2 , L = 2000 72.4 rbm h = 3 , L = 2000 72.2 rbm h = 4 , L = 1000 69.8 rbm h = 4 , L = 2000 69.5 1-k Dim = 25k 73.1 1-k Dim = 50k 70.2 1-k Dim = 100k 70.0 1-k Dim = 200k 70.0 T able 14: T oken Error Rates (%) for Combined Mo dels Mo del Bengali Cantonese best single system 69.5 65.7 rbm ( h = 3 , L = 2000) + 1-k 69.7 65.3 rbm ( h = 4 , L = 1000) + 1-k 69.2 64.9 rbm ( h = 4 , L = 2000) + 1-k 69.1 64.9 k ernel mo del, we computed the pre-activ ation similarly . Note that since they b oth predict the same set of lab els, the pre-activ ations from either mo del hav e the same dimensionalit y . W e p erform PCA on them indep enden tly and then visualize in 2D. Fig. 3 displays the tw o scatter plots where eac h has 1000 p oin ts, representing the means of the learned representations for data p oints in each class. T o visualize easily , we color each point not b y its phoneme state lab els. Instead, we collapse them into phone labels (which are considerably few, generally around 40 - 60). An initial examination seems to suggest that kernel models’ representations tend to form clumps for data from the same class. In the figure, the most obvious observ ation is the cluster in the blue color. On the other end, those blue color scattered data p oin ts do not seem to form a large and tigh t cluster under the represen tations learned b y the DNNs – they seem to be more spread out. The clumps seem to b e indicativ e of the Gaussian k ernels we hav e used. Ho wev er, how imp ortant they are and in what w ay , the more flourish patterns by DNNs’ representations are more adv antageous require more careful and detailed analysis. W e hope our w ork has pro vided enough incen tives and to ols for that pursuit. 23 −120 −100 −80 −60 −40 −20 0 20 40 60 80 −100 −80 −60 −40 −20 0 20 40 60 80 −40 −20 0 20 40 60 80 100 −80 −60 −40 −20 0 20 40 60 80 100 Figure 3: Contrast the learned representations by k ernel mo dels (left plot) and DNNs (righ t plot) 24
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment