Dictionary learning for fast classification based on soft-thresholding

Classifiers based on sparse representations have recently been shown to provide excellent results in many visual recognition and classification tasks. However, the high cost of computing sparse representations at test time is a major obstacle that li…

Authors: Alhussein Fawzi, Mike Davies, Pascal Frossard

Dictionary learning for fast classification based on soft-thresholding
Dictionary learning for fast classification based on soft-thresholding Alh ussein F a wzi ∗ Mik e Da vies † P ascal F rossard ∗ Octob er 3, 2014 Abstract Classifiers based on sparse represen tations hav e recently b een shown to pro vide excellent results in many visual recognition and classification tasks. Ho wev er, the high cost of computing sparse representations at test time is a ma jor obstacle that limits the applicability of these metho ds in large-scale problems, or in scenarios where computational pow er is restricted. W e consider in this pap er a simple yet efficient alternative to sparse co ding for feature extraction. W e study a classification scheme that applies the soft-thr esholding nonlinear mapping in a dictionary , follo wed by a linear classifier. A nov el supervised dictionary learning algorithm tailored for this low complexity classification arc hitecture is prop osed. The dictionary learning problem, which jointly learns the dictionary and linear classifier, is cast as a differ enc e of c onvex (DC) program and solved efficiently with an iterative DC solver. W e conduct exp erimen ts on several datasets, and show that our learning algorithm that lev erages the structure of the classification problem outperforms generic learning pro cedures. Our simple classifier based on soft-thresholding also competes with the recent sparse co ding classifiers, when the dictionary is learned appropriately . The adopted classification scheme further requires less computational time at the testing stage, compared to other classifiers. The prop osed sc heme shows the p oten tial of the adequately trained soft-thresholding mapping for classification and pav es the wa y tow ards the dev elopmen t of v ery efficient classification methods for vision problems. 1 In tro duction The recent decade has witnessed the emergence of h uge volumes of high dimensional information pro duced by all sorts of sensors. F or instance, a massive amoun t of high-resolution images are uploaded on the In ternet ev ery min ute. In this context, one of the key challenges is to develop tec hniques to pro cess these large amounts of data in a computationally efficient wa y . W e fo cus in this pap er on the image classific ation problem, which is one of the most c hallenging tasks in image analysis and computer vision. Giv en training examples from m ultiple classes, the goal is to find a rule that p ermits to predict the class of test samples. Line ar classific ation is a computationally efficient wa y to categorize test samples. It consists in finding a linear separator b et ween tw o classes. Linear classification has b een the fo cus of m uch researc h in statistics and mac hine learning for decades and the resulting algorithms are well understo o d. How ev er, many datasets cannot b e separated linearly and require complex nonlinear classifiers. A p opular nonlinear scheme, which leverages the efficency and simplicity of linear classifiers, em beds the data into a high dimensional feature space, where a linear classifier is even tually sough t. The feature space mapping is chosen to be nonlinear in order to con v ert nonlinear relations to linear relations. This nonlinear classification framew ork is at the heart of the p opular k ernel-based methods (Sha w e-T aylor and Cristianini, 2004) that make use of a computational shortcut to b ypass the explicit computation of feature v ectors. Despite the p opularit y of kernel-based classification, its computational complexit y at test time strongly dep ends on the num b er of training samples (Burges, 1998), which limits its applicability in large scale settings. A more recen t approach for nonlinear classification is based on sp arse c o ding , which consists in finding a compact representation of the data in an ov ercomplete dictionary . Sparse co ding is known to b e b eneficial in signal pro cessing tasks suc h as denoising (Elad and Aharon, 2006), inpainting (F adili et al, 2009), co ding (Figueras i V entura et al, 2006), but it has also recen tly emerged in the context of classification, where it is view ed as a nonlinear feature extraction mapping. It is usually follow ed by a linear classifier (Raina et al, 2007), but can also b e used in conjunction with other classifiers (W righ t et al, 2009). Classification architectures based on sparse co ding ha v e b een shown to work very well in practice and even ac hieve state-of-the-art results on particular tasks (Mairal et al, 2012; Y ang et al, 2009). The crucial drawbac k of sparse co ding classifiers is ho wev er the prohibitive cost of computing the sparse representation of a signal or image sample at test time. This limits the relev ance of such techniques in large-scale vision problems or when computational p ow er is scarce. ∗ Ecole Polytec hnique F ederale de Lausanne (EPFL), Signal Pro cessing Lab oratory (L TS4), Lausanne 1015-Switzerland. Email: (alhussein.fa wzi@epfl.ch, pascal.frossard@epfl.c h) † IDCOM, The Univ ersit y of Edin burgh, Edin burgh, UK. Email: mike.da vies@ed.ac.uk 1 Label D T w T Figure 1: Soft-thresholding classification sc heme. The b ox in the middle applies the soft-thresholding non- linearit y h α . T o remedy to these large computational requirements, w e adopt in the classification a computationally efficien t sparsifying transform, the soft thresholding mapping h α , defined b y: h α ( z ) = max(0 , z − α ) , ( z − α ) + , (1) for α ∈ R + and ( · ) + = max(0 , · ). Note that, unlike the usual definition of soft-thresholding giv en b y sgn( z )( | z | − α ) + , we consider here the one-side d version of the soft-thresholding map, where the function is equal to zero for negativ e v alues (see Fig. 3 (a) vs. Fig 3 (b)). The map h α is naturally extended to vectors z by applying the scalar map to eac h co ordinate indep endently . Given a dictionary D , this map can be applied to a transformed signal z = D T x that represents the co efficients of features in a signal x . Its outcome, whic h only considers the most important features of x , is used for classification. In more details, we consider in this pap er the following simple t w o-step pro cedure for classification: 1. F eature extraction: Let D = [ d 1 | . . . | d N ] ∈ R n × N and α ∈ R + . Giv en a test p oint x ∈ R n , compute h α ( D T x ). 2. Linear classification: Let w ∈ R N . If w T h α ( D T x ) is p ositiv e, assign x to class 1. Otherwise, assign to class − 1. The arc hitecture is illustrated in Fig. 1. The proposed classification scheme has the adv an tage of b eing simple, efficien t and easy to implement as it inv olv es a single matrix-vector multiplication and a max op eration. The soft-thresholding map has b een successfully used in (Coates and Ng, 2011), as well as in a num ber of deep learning arc hitectures (Ka vukcuoglu et al, 2010b), which shows the relev ance of this efficien t feature extraction mapping. The remark able results in Coates and Ng (2011) show that this simple enco der, when coupled with a standard learning algorithm, can often ac hiev e results comparable to those of sparse co ding, pro vided that the num b er of lab eled samples and the dictionary size are large enough. How ev er, when this is not the cas e, a prop er training of the classifier parameters ( D , w ) b ecomes crucial for reaching go od classification p erformance. This is the ob jective of this pap er. W e prop ose a nov el sup ervise d dictionary le arning algorithm, which we call LAST (Learning Algorithm for Soft-Thresholding classifier). It jointly learns the dictionary D and the linear classifier w tailored for the classification arc hitecture based on soft-thresholding. W e p ose the learning problem as an optimization problem comprising a loss term that con trols the classification accuracy and a regularizer that preven ts o verfitting. This problem is shown to b e a differ enc e-of-c onvex (DC) program, whic h is solv ed efficiently with an iterative DC solv er. W e then p erform extensiv e exp erimen ts on textures, digits and natural images datasets, and show that the prop osed classifier, coupled with our dictionary learning approac h, exhibits remark able p erformance with resp ect to numerous comp etitor metho ds. In particular, we show that our classifier pro vides comparable or b etter classification accuracy than sparse co ding schemes. The rest of this pap er is organized as follo ws. In the next Section, w e highligh t the related w ork. In Section 3, w e form ulate the dictionary learning problem for classifiers based on soft-thresholding. Section 4 then presen ts our no v el learning algorithm, LAST, based on DC optimization. In Section 5, we perform extensive experiments on textures, natural images and digits datasets and Section 6 finally gathers a n umber of imp ortan t observ ations on the dictionary learning algorithm, and the classification scheme. 2 Related w ork W e first highlight in this section the difference b etw een the prop osed approach and existing techniques from the sparse co ding and dictionary learning literature. Then, w e dra w a connection betw een the considered approach and neural net w ork models on the architecture and optimization asp ects. 2.1 Sparse co ding The classification sc heme adopted in this paper shares similarities with the now popular architectures that use sparse co ding at the feature extraction stage. W e recall that the sparse coding mapping, applied to a datap oin t 2 x in a dictionary D consists in solving the optimization problem argmin c ∈ R N k x − Dc k 2 2 + λ k c k 1 . (2) It is now known that, when the parameters of the sparse co ding classifier are trained in a discriminative wa y , excellen t classification results are obtained in many vision tasks (Mairal et al, 2012, 2008; Ramirez et al, 2010). In particular, significan t gains ov er the standard reconstructive dictionary learning approac hes are obtained when the dictionary is optimized for classification. Several dictionary learning metho ds also consider an additional structure (e.g., lo w-rankness) on the dictionary , in order to incorp orate a task-sp ecific prior knowledge (Zhang et al, 2013; Chen et al, 2012; Ma et al, 2012). This line of research is esp ecially p opular in face recognition applications, where a mixture of subspace model is kno wn to hold (W right et al, 2009). Up to our knowledge, all the discriminative dictionary learning metho ds optimize the dictionary in regards to the sparse co ding map in Eq. (2), or a v arian t that still requires to solv e a non trivial optimization problem. In our work how ev er, w e in troduce a discriminativ e dictionary learning metho d sp e cific to the efficient soft-thr esholding map . Interestingly , soft- thresholding can b e view ed as a coarse approximation to non-negative sparse co ding, as w e show in App endix A. This further motiv ates the use of soft-thresholding for feature extraction, as the merits of sparse co ding for classification are no w w ell-established. Closer to our work, several approaches ha v e b een introduced to approximate sparse co ding with a more efficien t feed-forward predictor (Ka vukcuoglu et al, 2010a; Gregor and LeCun, 2010), whose parameters are learned in order to minimize the approximation error with resp ect to sparse co des. These works are how ev er differen t from ours in several asp ects. First, our approac h do es not require the result of the soft-thresholding mapping to b e close to that of s parse co ding. W e rather require solely a go o d classification accuracy on the training samples. Moreov er, our dictionary learning approach is purely sup ervised, unlike Ka vukcuoglu et al (2010a,b). Finally , these metho ds often use nonlinear maps (e.g., h yp erbolic tangent in Kavuk cuoglu et al (2010a), multi-la y er soft-thresholding in Gregor and LeCun (2010)) that are different from the one considered in this pap er. The single soft-thresholding mapping considered here has the adv antage of b eing simple, v ery efficien t and easy to implemen t in practice. It is also strongly tied to sparse co ding (see App endix A). 2.2 Neural netw orks The classification architecture considered in our w ork is also quite strongly related to artificial neural netw ork mo dels (Bishop, 1995). Neural netw ork mo dels are multi-la y er arc hitectures, where each la yer consists of a set of neurons. The neurons compute a linear combination of the activ ation v alues of the preceding lay er, and an activation function is then used to con vert the neurons’ w eigh ted input to its activ ation v alue. Popular choices of activ ation functions are logistic sigmoid and hyperb olic tangent nonlinearities. Our classification arc hitecture can b e seen as a neural net work with one hidden la yer and h α as the hidden units’ activ ation function, and zero bias (Fig. 2). Equiv alen tly , the activ ation function can b e set to max(0 , x ) with a constan t bias − α across all hidden units. The dictionary D defines the connections b etw een the input and hidden la yer, while w represents the w eigh ts that connect the hidden la y er to the output. Output Hidden la yer Input Dictionar y D Normal vector w Figure 2: Neural netw ork representation of our classification architecture. Greyed neurons ha v e zero activ ation v alue. In an imp ortant recen t contribution, Glorot et al (2011) show ed that using the rectifier activ ation function max(0 , x ) results in b etter p erformance for deep net works than the more classical hyperb olic tangen t function. On top of that, the rectifier nonlinearit y is more biologically plausible, and leads to sparse net works; a prop erty that is highly desirable in representation learning (Bengio et al, 2013). While the arc hitecture considered in this pap er is close to that of Glorot et al (2011), it differs in several important aspects. First, our architecture assumes that hidden units hav e a bias equal to − α < 0, shared across all the hidden units, while it is unclear whether an y constraint on the bias is set in the existing rectifier net works. The parameter α is in timately related 3 to the sparsit y of the features. This can be justified b y the fact that h α is an appro ximant to the non-negative sparse co ding map with sparsity penalty α (see App endix A). Without imp osing any restriction on the neurons’ bias (e.g., negativity) in rectifier netw orks, the representation might how ev er not b e sparse. This p otentially explains the necessity to use an additional ` 1 sparsifying regularizer on the activ ation v alues in Glorot et al (2011) to enforce the sparsity of the netw ork, while sparsity is ac hieved implicitly in our scheme. Second, unlik e the work of (Glorot et al, 2011) that employs a biological argument to in tro duce the rectifier function, we c ho ose the soft-thresholding nonlinearit y due to its strong relation to sparse co ding. Our w ork therefore pro vides an indep endent motiv ation for considering the rectifier activ ation function, while the biological motiv ation in (Glorot et al, 2011) in turn gives us another motiv ation for considering soft-thresholding. Third, rectified linear units are very often used in the con text of deep net w orks (Maas et al, 2013; Zeiler et al, 2013), and seldom used with only one hidden la yer. In that se nse, the classification sc heme considered in this paper has a simpler description, and can be seen as a particular instance of the general neural netw ork models. F rom an optimization persp ective, our learning algorithm leverages the simplicity of our classification ar- c hitecture and is very differen t from the generic tec hniques used to train neural netw orks. In particular, while neural net w orks are generally trained with stochastic gradien t descen t, w e adopt an optimization based on the DC framew ork that directly exploits the structure of the learning problem. 3 Problem form ulation W e present b elow the learning problem, that estimates jointly the dictionary D ∈ R n × N and linear classifier w ∈ R N in our fast classification sc heme describ ed in Section 1. W e consider the binary classification task where X = [ x 1 | . . . | x m ] ∈ R n × m and y = [ y 1 | . . . | y m ] ∈ {− 1 , 1 } m denote resp ectiv ely the set of training p oin ts and their associated lab els. W e consider the follo wing sup ervised learning form ulation argmin D , w m X i =1 L ( y i w T h α ( D T x i )) + ν 2 k w k 2 2 , (3) where L denotes a con vex loss function that p enalizes incorrect classification of a training sample and ν is a regularization parameter that preven ts o verfitting. The soft-thresholding map h α has b een defined in Eq. (1). T ypical loss functions that can b e used in Eq. (3) are the hinge loss ( L ( x ) = max(0 , 1 − x )), which we adopt in this pap er, or its smo oth approximation, the logistic loss ( L ( x ) = log (1 + e − x )). The ab ov e optimization problem attempts to find a dictionary D and a linear separator w such that w T ( D T x i − α ) + has the same sign as y i on the training set, whic h leads to correct classification. At the same time, it keeps k w k 2 small in order to preven t o v erfitting. Note that to simplify the exposition, the bias term in the linear classifier is dropp ed. Ho wev er, our study extends straigh tforwardly to include nonzero bias. The problem formulation in Eq. (3) is reminiscent of the p opular supp ort vector machine (SVM) training pro cedure, where only a linear classifier w is learned. Instead, w e em bed the nonlinearit y directly in the problem form ulation, and learn jointly the dictionary D and the linear classifier w . This significan tly broadens the applicabilit y of the learned classifier to imp ortan t nonlinear classification tasks. Note how ever that adding a nonlinear mapping raises an important optimization challenge, as the learning problem is no more conv ex. When we lo ok closer at the optimization problem in Eq. (3), we note that, for any α > 0, the ob jective function is equal to: m X i =1 L ( y i α w T h 1 ( D T x i /α )) + ν 2 k w k 2 2 = m X i =1 L ( y i ˜ w T h 1 ( ˜ D T x i )) + ν 0 2 k ˜ w k 2 2 , where ˜ w = α w , ˜ D = D /α and ν 0 = ν /α 2 . Therefore, without loss of generality , we set the sparsity parameter α to 1 in the rest of this paper. This is in con trast with traditional dictionary learning approac hes based on ` 0 or ` 1 minimization problems, where a sparsit y parameter needs to b e set man ually beforehand. Fixing α = 1 and unconstraining the norms of the dictionary atoms essentially p ermits to adapt the sparsity to the problem at hand. This represents an imp ortant adv antage, as setting the sparsity parameter is in general a difficult task. A sample x is then assigned to class ‘+1’ if w T h 1 ( D T x ) > 0, and class ‘ − 1’ otherwise. Finally , we note that, ev en if our fo cus primarily goes to the binary classification problem, the extension to m ulti-class can be easily done through a one-vs-all strategy , for instance. 4 Learning algorithm The problem in Eq. (3) is non-conv ex and difficult to solve in general. In this section, we propose to relax the original optimization problem and cast it as a differ enc e-of-c onvex (DC) program. Leveraging this property , 4 w e in tro duce LAST, an efficient algorithm for learning the dictionary and the classifier parameters in our classification sc heme based on soft-thresholding. 4.1 Relaxed formulation W e rewrite now the learning problem in an appropriate form for optimization. W e start with a simple but crucial c hange of v ariables. Sp ecifically , we define u j ← | w j | d j , v j ← | w j | and s j ← sgn( w j ). Using this c hange of v ariables, we hav e for an y 1 ≤ i ≤ m , y i w T h 1 ( D T x i ) = y i N X j =1 sgn( w j )( | w j | d T j x i − | w j | ) + = y i N X j =1 s j ( u T j x i − v j ) + . Therefore, the problem in Eq.(3), with α = 1, can b e rewritten in the follo wing w a y: argmin U , v , s m X i =1 L   y i N X j =1 s j ( u T j x i − v j ) +   + ν 2 k v k 2 2 , (4) sub ject to v > 0 . The equiv alence b etw een the t w o problem form ulations in Eqs. (3) and (4) only holds when the comp onents of the linear classifier w are restricted to b e all non zero. This is how ever not a limiting assumption as zero comp onen ts in the normal vector of the optimal hyperplane of Eq. (3) can b e remov ed, which is equiv alent to using a dictionary of smaller size. The v ariable s , that is the sign of the comp onents of w , essentially enco des the “classes” of the different atoms. In other words, an atom d j for which s j = +1 (i.e., w j is p ositive) is most likely to b e active for samples of class ‘1’. Con versely , atoms with s j = − 1 are most lik ely activ e for class ‘ − 1’ samples. W e assume here that the vector s is known a priori. In other words, this means that we hav e a prior kno wledge on the prop ortion of class 1 and class − 1 atoms in the desired dictionary . F or example, setting half of the entries of the v ector s to b e equal to +1 and the other half to − 1 encodes the prior knowledge that we are searc hing for a dictionary with a balanced n um b er of class-sp ecific atoms. Note that s can b e estimated from the distribution of the differen t classes in the training set, assuming that the prop ortion of class-sp ecific atoms in the dictionary should appro ximately follo w that of the training samples. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (a) −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (b) Figure 3: (a): sgn( x )( | x | − α ) + , (b): h α (solid), and its smo oth approximation q ( x − α ) (dashed), with β = 10. W e used α = 1. After the ab ov e change of v ariables, we now approximate the term ( u T j x i − v j ) + in Eq.(4) with a smo oth function q ( u T j x i − v j ) where q ( x ) = 1 β log (1 + exp ( β x )), and β is a parameter that con trols the accuracy of the appro ximation (Fig. 3 (b)). Specifically , as β increases, the quality of the appro ximation b ecomes b etter. The function q with β = 1 is often referred to as “soft-plus” and plays an important role in the training ob jective of man y classification schemes, such as the classification restricted Boltzmann machines (Laro c helle et al, 2012). Note that this approximation is used only to mak e the optimization easier at the learning stage; at test time, the original soft-thresholding is applied for feature extraction. Finally , we replace the strict inequality v > 0 in Eq. (4) with v ≥  , where  is a small p ositive constant n umber. The latter constraint is easier to handle in the optimization, yet b oth constrain ts are essen tially equiv alent in practice. 5 W e end up with the follo wing optimization problem: (P) : argmin U , v m X i =1 L   y i N X j =1 s j q ( u T j x i − v j )   + ν 2 k v k 2 2 , sub ject to v ≥ , that is a relaxed v ersion of the learning problem in Eq. (4). Once the optimal v ariables ( U , v ) are determined, D and w can b e obtained using the abov e c hange of v ariables. 4.2 DC decomp osition The problem (P) is still a nonconv ex optimization problem that can be hard to solv e using traditional metho ds, suc h as gradient descent or Newton-type metho ds. How ev er, we show in this section that problem (P) can b e written as a differ enc e of c onvex (DC) program (Horst, 2000) which leads to efficient solutions. W e first define DC functions. A real-v alued function f defined on a conv ex set U ⊆ R n is called DC on U if, for all x ∈ U , f can b e expressed in the form f ( x ) = g ( x ) − h ( x ) , where g and h are conv ex functions on U . A represen tation of the abov e form is said to b e a DC decomp osition of f . Note that DC decomp ositions are clearly not unique, as f ( x ) = ( g ( x ) + c ( x )) − ( h ( x ) + c ( x )) pro vides other decomp ositions of f , for an y conv ex function c . Optimization problems of the form min x { f ( x ) : f i ( x ) ≤ 0 , i = 1 , . . . , p } , where f and f i for 1 ≤ i ≤ p are all DC functions, are called DC pr o gr ams . The follo wing proposition now states that the problem (P) is DC: Prop osition 1 F or any c onvex loss function L and any c onvex function q , the pr oblem ( P ) is DC. While Prop osition 1 states that the problem (P) is DC, it does not give an explicit decomp osition of the ob jective function, whic h is crucial for optimization. The following proposition exhibits a decomp osition when L is the hinge loss. Prop osition 2 When L ( x ) = max(0 , 1 − x ) , the obje ctive function of pr oblem (P) is e qual to g − h , wher e g = ν 2 k v k 2 2 + m X i =1 max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  , h = m X i =1 X j : s j = y i q ( u T j x i − v j ) . The proofs of Propositions 1 and 2 are given in App endix B. Due to Prop osition 2, the problem (P) can b e solv ed efficien tly using a DC solv er. 4.3 Optimization DC problems are well studied optimization problems and efficien t optimization algorithms hav e been proposed in (Horst, 2000; T ao and An, 1998) with go od p erformance in practice (see An and T ao (2005) and references therein, Srip erumbudur et al (2007)). While there exists a n um b er of p opular approaches that solve glob al ly DC programs (e.g., cutting plane and branch-and-bound algorithms (Horst, 2000)), these techniques are often inefficien t and limited to v ery small scale problems. A robust and efficient difference of con vex algorithm (DCA) is prop osed in T ao and An (1998), which is suited for solving general large scale DC programs. DCA is an iterativ e algorithm that consists in solving, at each iteration, the conv ex optimization problem obtained b y linearizing h (i.e., the non conv ex part of f = g − h ) around the current solution. The lo cal conv ergence of DCA is prov en in Theorem 3.7 of T ao and An (1998), and we refer to this pap er for further theoretical guarantees on the stability and robustness of the algorithm. Although DCA is only guaranteed to reach a lo cal minima, the authors of T ao and An (1998) state that DCA often conv erges to a global optimum. When this is not the case, using multiple restarts migh t b e used to improv e the solution. W e note that DCA is very close to the conca ve-con vex procedure (CCCP) introduced in (Y uille et al, 2002). A t iteration k of DCA, the linearized optimization problem is giv en b y: argmin ( U , v ) { g ( U , v ) − T r ( U T A ) − v T b } sub ject to v ≥ . (5) 6 where ( A , b ) = ∇ h ( U k , v k ) and ( U k , v k ) are the solution estimates at iteration k , and the functions g and h are defined in Prop osition 2. Note that, due to the con vexit y of g , the problem in Eq. (5) is conv ex and can b e solv ed using any conv ex optimization algorithm (Boyd and V andenberghe, 2004). The metho d we prop ose to use here is a pro jected first-order sto chastic subgradien t descent algorithm. Sto chastic gradient descent is an efficien t optimization algorithm that can handle large training sets (Ak ata et al, 2014). T o mak e the exposition clearer, w e first define the function: p ( U , v ; x i , y i ) = max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  + 1 m  ν 2 k v k 2 2 − T r ( U T A ) − v T b  . The ob jectiv e function of Eq. (5) that w e wish to minimize can then be written as P m i =1 p ( U , v ; x i , y i ). W e solv e this optimization problem with the pro jected sto c hastic subgradien t descen t algorithm in Algorithm 1. Algorithm 1 Optimization algorithm to solve the linearized problem in Eq. (5) 1 . Initialization: U ← U k and v ← v k . 2 . F or t = 1 , . . . , T 2 . 1 Let ( x , y ) be a randomly chosen training p oint, and its associated lab el. 2 . 2 Cho ose the stepsize ρ t ← min( ρ, ρ t 0 t ). 2 . 3 Up date U , and v , by pro jected subgradient step: U ← U − ρ t ∂ U p ( U , v ; x , y ) , v ← Π v ≥  ( v − ρ t ∂ v p ( U , v ; x , y )) , where Π v ≥  is the pro jection op erator on the set v ≥  . 3 . Return U k + 1 ← U and v k + 1 ← v . In more details, at each iteration of Algorithm 1, a training sample ( x , y ) is drawn. U and v are then up dated b y p erforming a step in the direction ∂ p ( U , v ; x , y ). Many different stepsize rules can b e used with sto chastic gradien t descent metho ds. In this pap er, similarly to the strategy employ ed in Mairal et al (2012), w e hav e c hosen a stepsize that remains constant for the first t 0 iterations, and then takes the v alue ρt 0 /t . 1 Moreo ver, to accelerate the conv ergence of the sto chastic gradien t descent algorithm, w e consider a small v ariation of Algorithm 1, where a minibatch containing several training samples along with their lab els is drawn at each iteration, instead of a single sample. This is a classical heuristic in stochastic gradient descen t algorithms. Note that, when the size of the minibatch is equal to the num b er of training samples, this algorithm reduces to traditional batc h gradien t descen t. Finally , our complete LAST learning algorithm based on DCA is formally giv en in Algorithm 2. Starting from a feasible p oint U 0 and v 0 , LAST solv es iteratively the constrained conv ex problem giv en in Eq. (5) with the solution prop osed in Algorithm 1. Recall that this problem corresp onds to the original DC program (P), except that the function h has been replaced by its linear approximation around the curren t solution ( U k , v k ) at iteration k . Many criteria can b e used to terminate the algorithm. W e c ho ose here to terminate when a maxim um num ber of iterations K has b een reac hed, and terminate the algorithm earlier when the following condition is satisfied: min  | ( ω k +1 − ω k ) i,j | ,     ( ω k +1 − ω k ) i,j ( ω k ) i,j      ≤ δ , where the matrix Ω k = ( ω k ) i,j is the row concatenation of U and v T , and δ is a small p ositiv e num ber. This condition detects the con vergence of the learning algorithm, and is verified whenever the c hange in U and v is v ery small. This termination criterion is used for example in Sriperumbudur et al (2007). 5 Exp erimen tal results In this section, w e ev aluate the performance of our classification algorithm on textures, digits and natural images datasets, and compare it to different competitor schemes. W e exp ose in Section 5.1 the choice of the parameters of the mo del and the algorithm. W e then fo cus on the exp erimental assessment of our scheme. F ollowing the metho dology of Coates and Ng (2011), w e break the feature extraction algorithms into (i) a learning algorithm (e.g, K-Means) where a set of basis functions (or dictionary) is learned and (ii) an enco ding function (e.g., ` 1 1 The precise c hoice of the parameters ρ and t 0 are discussed later in Section 5.1. 7 Algorithm 2 LAST (Learning Algorithm for Soft-Thresholding classifier) 1 . Choose any initial point: U 0 and v 0 ≥  . 2 . F or k = 0 , . . . , K − 1, 2 . 1 Compute ( A , b ) = ∇ h ( U k , v k ). 2 . 2 Solve with Algorithm 1 the conv ex optimization problem: ( U k + 1 , v k + 1 ) ← argmin ( U , v ) { g ( U , v ) − T r ( U T A ) − v T b } sub ject to v ≥ . 2 . 3 If ( U k + 1 , v k + 1 ) ≈ ( U k , v k ), return ( U k + 1 , v k + 1 ). sparse co ding) that maps an input p oin t to its feature vector. In a first step of our analysis (Section 5.2), w e therefore fix the enc o der to b e the soft-thresholding mapping and compare LAST to existing supervised and unsup ervised learning techniques. Then, in the following subsections, we compare our complete classification arc hitecture (i.e., learning and enco ding function) to sev eral classifiers, in terms of accuracy and efficiency . In particular, we show that our prop osed approach is able to compete with recen t classifiers, despite its simplicit y . 5.1 P arameter selection W e first discuss the choice of the mo del parameters for our metho d. Unless stated otherwise, w e c ho ose the v ector s according to the distribution of the differen t classes in the training set. W e set the v alue of the regularization parameter to ν = 1, as it was found empirically to b e a go o d c hoice in our exp eriments. It is worth mentioning that setting ν b y cross-v alidation migh t giv e better results, but it w ould also b e computationally more expensive. W e set moreov er the parameter of the soft-thresholding mapping approximation to β = 100. Recall finally that the sparsity parameter α is alwa ys equal to 1 in our method, and therefore does not require any manual setting or cross-v alidation pro cedure. In all experiments, we hav e moreov er chosen to initialize LAST b y setting U 0 equal to a random subsample of the training set, and v 0 is set to the v ector whose entries are all equal to 1. W e ho w ever noticed empirically that c hoosing a differen t initialization strategy do es not significantly change the testing accuracy . Then, w e fix the maximum n umber of iterations of LAST to K = 50. Moreov er, setting prop erly the parameters t 0 and ρ in Algorithm 1 is quite crucial in con trolling the conv ergence of the algorithm. In all the experiments, w e hav e set the parameter t 0 = T / 10, where T denotes the nu mber of iterations. F urthermore, during the first T / 20 iterations, several v alues of ρ are tested { 0 . 1 , 0 . 01 , 0 . 001 } , and the v alue that leads to the smallest ob jective function is c hosen for the rest of the iterations. Finally , the minibatch size in Algorithm 1 depends on the size of the training data. In particular, when the size of the training data m is relatively small (i.e., smaller than 5000), we used a batch gradient descen t, as the computation of the (complete) gradient is tractable. In this case, we set the n umber of iterations to T = 1000. Otherwise, w e use a batch size of 200, and p erform T = 5000 iterations of the stochastic gradien t descent in Algorithm 1. 5.2 Analysis of the learning algorithm In a first set of exp eriments, we fo cus on the comparison of our learning algorithm (LAST) to other learn- ing techniques, and fix the enco der to b e the soft-thresholding mapping for all the metho ds. W e present a comparativ e study on textures and natural images classification tasks. 5.2.1 Exp erimen tal settings W e consider the follo wing dictionary learning algorithms: 1. Sup ervised random samples: The atoms of D are c hosen randomly from the training set, in a su- p ervised manner. That is, if κ denotes the desired proportion of class ‘1’ atoms in the dictionary , the dictionary is built b y randomly picking κN training samples from class ‘1’ and (1 − κ ) N samples from class ‘ − 1’, where N is the n um b er of atoms in the dictionary . 2. Sup ervised K-means: W e build the dictionary by merging the sub dictionaries obtained by applying the K-means algorithm successively to training samples of class ‘1’ and ‘ − 1’, where the num b er of clusters is fixed respectively to κN and (1 − κ ) N . 8 3. Dictionary learning for ` 1 sparse co ding: The dictionary D is built by solving the classical dictionary learning problem for ` 1 sparse coding: min D , c i m X i =1 k x i − Dc i k 2 2 + λ k c i k 1 sub ject to ∀ j, k d j k 2 2 ≤ 1 . (6) T o solve this optimization problem, w e used the algorithm prop osed b y Mairal et al (2010) and imple- men ted in the SP AMS pack age. The parameter λ is chosen by a cross-v alidation procedure in the set { 0 . 1 , 0 . 01 , 0 . 001 } . Note that, while the previous tw o learning algorithms make use of the lab els, this algorithm is unsupervised. 4. Sto chastic Gradient Descen t (SGD): The dictionary D and classifier w are obtained by optimizing the follo wing ob jective function using mini-b atch sto chastic gr adient desc ent : J ( D , w ) = m X i =1 L ( y i w T q ( D T x i − α )) + ν 2 k w k 2 2 , with q ( x ) = 1 β log(1 + exp( β x )). This corresp onds to the original ob jective function in Eq. (3), where h α is replaced with its smo oth approximan t. 2 This smo othing pro cedure is similar to the one used in our relaxed formulation (Section 4.1). As in LAST, w e set β = 100, α = 1, and use the same initialization strategy . This setting allo ws us to directly compare LAST and this generic stochastic gradient descen t pro cedure widely used for training neural netw orks. F ollo wing Glorot et al (2011), we use a mini-batc h size of 10, and use a constan t step size chosen in { 0 . 1 , 0 . 01 , 0 . 001 , 0 . 0001 } . The stepsize is c hosen through a cross-v alidation pro cedure, with a randomly c hosen v alidation set made up of 10% of the training data. The n um b er of iterations of SGD is set to 250000. F or the first three algorithms, the parameter α in the soft-thresholding mapping is chosen with cross v ali- dation in { 0 . 1 , 0 . 2 , . . . , 0 . 9 , 1 } . The features are then computed b y applying the soft thresholding map h α , and a linear SVM classifier is trained in the feature space. F or the random samples and K -means approaches, we set κ = 0 . 5 as we consider classification tasks with roughly equal num ber of training samples from each class. Finally , for SGD and LAST, the dictionary D and linear classifier w are learned simultaneously . The enco der h 1 is used to compute the features. 5.2.2 Exp erimen tal results In our first experiment, w e consider t wo binary texture classification tasks, where the textures are collected from the 32 Bro datz dataset (V alkealah ti and Oja, 1998) and shown in Fig. 4. F or each pair of textures under test, w e build the training set by randomly selecting 500 12 × 12 patches p er texture, and the test data is constructed similarly by taking 500 patc hes p er texture. The test data do es not con tain any of the training patches. All the patc hes are moreov er normalized to hav e unit ` 2 norm. Fig. 5 shows the binary classification accuracy of the soft-thresholding based classifier as a function of the dictionary size, for dictionaries learned with the different algorithms. T ask 1 T ask 2 vs vs Bark W oodgrain Pigskin Pressedcl Figure 4: Two binary classification tasks ( b ark vs wo o dgr ain and pigskin vs. pr esse dcl ) F or the first task ( b ark vs. wo o dgr ain ), one can see that LAST and SGD dictionary learning metho ds outp erform the other metho ds for small dictionary sizes. F or large dictionaries (i.e., N ≈ 400) how ev er, all the learning algorithms yield approximately the same classification accuracy . This result is in agreement with the conclusions of Coates and Ng (2011), where the authors show empirically that the choice of the learning 2 W e also tested SGD on the original (non-smo oth) optimization problem. This resulted in slightly worse p erformance. W e therefore only report results obtained on the smo othed ob jective function. 9 algorithm b ecomes less crucial when dictionaries are very large. In the second and more difficult classification task ( pigskin vs. pr esse dcl ), our algorithm yields the b est classification accuracy for all tested dictionary sizes (10 ≤ N ≤ 400). In terestingly , unlik e the previous task, the design of the dictionary is crucial for all tested dictionary sizes. Using muc h larger dictionaries might result in p erformance that is close to the one obtained using our algorithm, but comes at the price of additional computational and memory costs. 0 50 100 150 200 250 300 350 400 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Dictionary size Classification accuracy Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST (a) Bark vs. Wo o dgr ain 0 50 100 150 200 250 300 350 400 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Dictionary size Classification accuracy Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST (b) Pigskin vs. Pr esse dcl Figure 5: T exture classification results (fixed soft-thresholding encoder) Fig. 6 further illustrates the ev olution of the ob jectiv e function with resp ect to the elapsed training time for LAST and SGD, for a dictionary of size 50. One can see that LAST quickly con verges to a solution with a small ob jectiv e function. On the other hand, SGD reac hes a solution with larger ob jective function than LAST. 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 900 1000 1100 Time [s] J(D, w) SGD LAST Figure 6: J ( D , w ) as a function of the elapsed time [s] for Sto chastic Gradient Descent and LAST. F or SGD: J ( D t =100 , w t =100 ) = 19, LAST: J ( D t =100 , w t =100 ) = 1 . 4. W e now conduct exp eriments on the popular CIF AR-10 image database (Krizhevsky and Hinton, 2009). The dataset contains 10 classes of 32 × 32 RGB images. F or simplicit y and b etter comparison of the different learning algorithms, we restrict in a first stage the dataset to the tw o classes “deer” and “horse”. W e extend our results to the multi-class scenario later in Section 5.5. Fig. 7 illustrates some training examples from the t wo classes. The classification results are reported in Fig. 8. Figure 7: Examples of CIF AR-10 images in categories “deer” and “horse”. Once again, the soft-thresholding based classifier with a dictionary and linear classifier learned with LAST outp erforms all other learning tec hniques. In particular, using the LAST dictionary learning strategy results in significan tly higher p erformance than sto chastic gradien t descent for all dictionary sizes. W e further note that with a very small dictionary (i.e., N = 2), LAST reaches an accuracy of 77%, whereas some learning algorithms (e.g., K-means) do not reach this accuracy ev en with a dictionary that contains as man y as 400 atoms. T o 10 0 50 100 150 200 250 300 350 400 0.65 0.7 0.75 0.8 0.85 Dictionary size Classification accuracy Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST Figure 8: P erformance of the “deer” vs. “horse” binary classification task (fixed soft-thresholding encoder) further illustrate this p oint, we sho w in Fig. 9 the 2-D testing features obtained with a dictionary of tw o atoms, when D is learned resp ectiv ely with the K-Means metho d and LAST. Despite the v ery low-dimensionalit y of the feature vectors, the tw o classes can b e separated with a reasonable accuracy using our algorithm (Fig. 9 (b)), whereas features obtained with the K-means algorithm clearly cannot b e discriminated (Fig. 9 (a)). W e finally illustrate in Fig. 10 the dictionaries learned using K-Means and LAST for N = 30 atoms. It can b e observ ed that, while K-Means dictionary consists of smo othed images that minimize the reconstruction error, our algorithm learns a discriminative dictionary whose goal is to underline the difference betw een the images of the t w o classes. In summary , our sup ervised learning algorithm, specifically tailored for the soft-thresholding enco der provides significan t impro vemen ts ov er traditional dictionary learning schemes. Our classifier can reac h high accuracy rates, ev en with v ery small dictionaries, whic h is not p ossible with other learning schemes. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 max(0, d 1 T x − 1) max(0, d 2 T x − 1) Class 1 testpoints Class 2 testpoints (a) K-Means 0 0.1 0.2 0.3 0.4 0.5 0 500 1000 1500 2000 2500 3000 max(0, d 1 T x − 1) max(0, d 2 T x − 1) Class 1 testpoints Class 2 testpoints (b) LAST Figure 9: Learned 2D features and linear classifiers with K-Means and LAST for the “deer” vs. “horse” classification task ( N = 2). (a) K-Means (b) LAST Figure 10: Normalized dictionary atoms learned with K-Means and LAST, for the “deer” vs. “horse” binary classification task ( N = 30). 5.3 Classification p erformance on binary datasets In this section, we compare the prop osed LAST classification metho d 3 to other classifiers. Before going through the experimental results, w e first presen t the different methods under comparison: 1. Linear SVM: W e use the efficien t Liblinear (F an et al, 2008) implemen tation for training the linear classifier. The regularization parameter is c hosen using a cross-v alidation pro cedure. 3 By extension, w e define the LAST classifier to b e the soft-thresholding based classifier, where the parameters ( D , w ) are learned with LAST. 11 2. RBF kernel SVM: W e use LibSVM (Chang and Lin, 2011) for training. Similarly , the regularization and width parameters are set with cross-v alidation. 3. Sparse co ding: Similarly to the previous section, we train the dictionary by solving Eq. (6). W e use ho wev er the enco der that “matc hes naturally” with this training algorithm, that is: argmin c k x − Dc k 2 2 + λ k c k 1 , where x is the test sample, D the previously learned dictionary and c the resulting feature vector. A linear SVM is then trained on the resulting feature v ectors. This classification arc hitecture, denoted “sparse coding” b elow, is similar to that of Raina et al (2007). 4. Nearest neighbor classifier (NN): Our last comparativ e scheme is a nearest neigh bor classifier where the dictionary is learned using the sup ervised K-means pro cedure describ ed in 5.2.1. At test time, the sample is assigned the label of the dictionary atom (i.e., cluster) that is closest to it. Note that w e ha v e dropp ed the supervised random samples learning algorithm used in the previous section as it w as sho wn to ha v e w orse classification accuracy than the K-means approach. T ask 1 [%] T ask 2 [%] Linear SVM 49.5 49.1 RBF k ernel SVM 98.5 90.1 Sparse coding ( N = 50) 97.5 85.5 Sparse coding ( N = 400) 98.1 90.9 NN ( N = 50) 94.3 84.1 NN ( N = 400) 97.8 86.6 LAST ( N = 50) 98.7 87.3 LAST ( N = 400) 98.6 93.5 T able 1: Classification accuracy for binary texture classification tasks. T able 1 first sho ws the accuracies of the different classifiers in the tw o binary textures classification tasks describ ed in 5.2.2. In both exp erimen ts, the linear SVM classifier results in a v ery p o or p erformance, which is close to the random classifier. This suggests that the considered task is nonlinear, and has to be tac kled with a nonlinear classifier. One can see that the RBF k ernel SVM results in a significan t increase in the classification accuracy . Similarly , the ` 1 sparse co ding non linear mapping also results in m uc h b etter performance compared to the linear classifier, while the nearest neigh b or approach p erforms a bit w orse than sparse co ding. W e note that, for a fixed dictionary size, our classifier outp erforms NN and sparse co ding classifiers in b oth tasks. Moreo ver, it provides comparable or sup erior p erformance to the RBF k ernel SVM in both tasks. W e now turn to the binary exp eriment “deer” vs. “horse” describ ed in the previous subsection. W e show the classification accuracies of the different classifiers in T able 2. LAST outperforms sparse coding and nearest neigh b our classifiers for the tested dictionary sizes. RBF k ernel SVM how ev er sligh tly outp erforms LAST with N = 100 in this exp eriment. Note ho wev er that the RBF kernel SVM approach is muc h slow er at test time, whic h mak es it impractical for large-scale problems. “deer” vs. “horse” [%] Linear SVM 72.6 RBF k ernel SVM 83.5 Sparse coding ( N = 10) 70.6 Sparse coding ( N = 100) 76.2 NN ( N = 10) 67.7 NN ( N = 100) 70.9 LAST ( N = 10) 80.1 LAST ( N = 100) 82.8 T able 2: Binary classification accuracy on the binary classification problem “deer” vs. “horse”. Ov erall, the prop osed LAST classifier compares fav orably to the differen t tested classifiers. In particular, LAST outp erforms the sparse co ding technique for a fixed dictionary size in our exp eriments. This result is notable, as sparse co ding classifiers are known to provide very go od classification p erformance in vision tasks. Note that, when used with another standard learning approach as K-Means, the soft-thresholding based classifier is outp erformed by sparse co ding, which shows the importance of the learning scheme in the success of this classifier. 12 MNIST USPS Linear SVM 8.19 9.07 RBF k ernel SVM 1.4 4.2 K-NN ` 2 5.0 5.2 LAST 1.32 4.53 Sparse coding 3.0 5.33 Huang and Aviy en te (2006) - 6.05 SDL-G L (Mairal et al, 2008) 3.56 6.67 SDL-D L (Mairal et al, 2008) 1.05 3.54 Ramirez et al (2010) 1.26 3.98 SGD 2.22 5.88 3 la y ers ReLU net (Glorot et al, 2011) 1.43 - T able 3: Classification error (p ercen tage) on MNIST and USPS datasets. 5.4 Handwritten digits classification W e now consider a classification task on the MNIST (LeCun et al, 1998) and USPS (Hull, 1994) handwritten digits datasets. USPS con tains 9298 images of size 16 × 16 pixels, with 7291 images used for training and 2007 for testing. The larger MNIST database is composed of 60000 training images and 10000 test images, all of size 28 × 28 pixels. W e prepro cess all the images to ha v e zero-mean and to be of unit Euclidean norm. W e address the multi-class classification task using a one-vs-all strategy , as it is often done in classification problems. Sp ecifically , we learn a separate dictionary and a binary linear classifier b y solving the optimization problem for each one-vs-all problem. Classification is then done by predicting using eac h binary classifier, and c ho osing the prediction with highest score. In LAST, for each one-vs-all task, we naturally set 1 / 10 of the en tries of s to 1 and the other en tries to − 1, assuming the distribution of features of the different classes in the dictionary should roughly b e that of the images in the training set. In our prop osed approach and SGD, we used dictionaries of size N = 200 for USPS and N = 400 for MNIST as the latter dataset contains m uc h more training samples. W e compare LAST to baseline classification techniques described in the previous section, as w ell as to sparse coding based metho ds. In addition to building the dictionary in an unsupervised wa y , w e consider the sparse co ding classifiers in Mairal et al (2008); Huang and Aviyen te (2006); Ramirez et al (2010), whic h construct the dictionary in a supervised fashion. Classification results are shown in T able 3. One can see that LAST largely outp erforms linear and nearest neigh b our classifiers. Moreo v er, our metho d has a slightly b etter accuracy than RBF-SVM in MNIST, while b eing slightly worse on the USPS dataset. Our approach also outperforms the soft-thresholding based classifier optimized with sto c hastic gradient descen t on b oth tasks, whic h highlights the b enefits of our optimization tec hnique compared to the standard algorithm used for training neural netw orks. W e also rep ort from Glorot et al (2011) the p erformance of a three hidden lay er rectified net w ork optimized with sto chastic gradien t decent, without unsupervised pre-training. It can b e seen that LAST, while ha ving a m uc h simpler arc hitecture, slightly outp erforms the deep rectifier net w ork on the MNIST task. F urthermore, LAST outperforms the unsup ervised sparse co ding classifier in b oth datasets. Interestingly , the proposed scheme also comp etes with, and sometimes outp erforms the discriminative sparse co ding techniques of (Huang and Aviyen te, 2006; Mairal et al, 2008; Ramirez et al, 2010), where the dictionary is tuned for classification. While providing comparable results, the LAST classifier is muc h faster at test time than sparse co ding techniques and RBF-SVM classifiers. It is notew orthy to men tion that the b est discriminative dictionary learning results w e are aw are of on these datasets are ac hieved by Mairal et al (2012) with an error rate of 0 . 54% on MNIST and 2 . 84% on USPS. Note how ev er that in this pap er, the authors explicitly incorp orate translation inv ariance in the problem by augmen ting the training set with shifted versions of the digits. Our fo cus go es here instead on metho ds that do not augment the training set with distorted or transformed samples. 5.5 CIF AR-10 classification W e no w consider the m ulti-class classification problem on the CIF AR-10 dataset (Krizhevsky and Hin ton, 2009). The dataset contains 60000 color images of size 32 × 32 pixels, with 50000 images for training and 10000 for testing. The classifier input consists of vectors of ra w pixel v alues of dimension 32 × 32 × 3 = 3072. This setting, similar to that of Glorot et al (2011), tak es no adv antage of the fact that we are dealing with images and is sometimes referred to as “p erm utation in v ariant”, a s columns in the data could b e shuffled without affecting the result. W e consider this scenario to fo cus on the comparison of the p erformance of the classifiers. Due to the relatively high dimensions of the problem ( n = 3072, m = 50000), we limit ourselves to classifiers with feedforw ard architectures. In fact, using RBF-SVM for this task would b e prohibitiv ely slow at the training and testing stage. F or eac h one-vs-all task, we set the dictionary size of LAST and SGD metho ds to 400. Moreo v er, 13 CIF AR-10 Linear SVM 59.70 LAST ( N = 400) 46.56 SGD ( N = 400) 52.96 3 la y ers ReLU net 50.86 3 la y ers ReLU net + sup. pre-train 49.96 T able 4: Classification error (p ercentage) on the CIF AR-10 dataset. ReLU net results are rep orted from (Glorot et al, 2011). unlik e the previous experiment, we set in LAST half of the entries of the sign v ector s to 1 and the other half to − 1. This is due to the high v ariability of intra-class images and the relativ ely small dictionary size: the num ber of atoms required to enco de the positive class might not b e sufficien t if s is set according to the distribution of images in the training set. The results are reported in T able 4. Once again, this exp erimen t confirms the superiority of our learning algorithm o v er linear SVM. Moreo v er, LAST significantly outp erforms the generic SGD training algorithm (by more than 6%) in this c hallenging classification example. What is more surprising is that LAST significantly surpasses the rectifier neural net work with 3 hidden la y ers (Glorot et al, 2011) trained using a generic sto c hastic gradient descen t algorithm (with or without pre-training). This shows that, despite the simplicity of our architecture (it can b e seen as one hidden la y er), the adequate training of the classification scheme can give better p erformance than complicated structures that are p oten tially difficult to train. W e finally rep ort the results of sparse co ding classifier with a dictionary trained using Eq. (6). If we use a dictionary with 400 atoms, we get an error of 53 . 9%. By using a m uch larger dictionary of 4000 atoms, the error reduces to 46 . 5%. The computation of the test features is ho wev er computationally very expensive in that case. 6 Discussion W e first discuss in this section asp ects related to the computational complexit y of LAST. Then, w e analyze the sparsit y of the obtained solutions. W e finally explain some of the differences b etw een LAST and the generic sto c hastic gradien t descen t algorithm. 6.1 Computational complexity at test time W e compare the computational complexity and running times of LAST classifier to the ones of differen t classi- fication algorithms. T able 5 shows the computational complexity for classifying one test sample using v arious classifiers and the time needed to classify MNIST test images. W e recall that n , m , and N denote resp ectiv ely the signals dimension, the n umber of training samples and the dictionary size. Clearly , linear classification is v ery efficien t as it only requires the computation of one inner pro duct b etw een tw o v ectors of dimension n . Nonlinear SVMs ho wev er hav e a test complexity that is linear in the num ber of supp ort vectors, which scales linearly with the training size (Burges, 1998). This solution is therefore not practical for relativ ely large training sets, like MNIST or CIF AR-10. F eature extraction with sparse co ding in volv es solving an optimization problem, whic h roughly requires 1 / √  matrix-vector m ultiplications, where  controls the precision (Bec k and T eb oulle, 2009). F or a typical v alue of  = 10 − 6 , the complexit y b ecomes 1000 nN (neglecting other constan ts), that is 3 orders of magnitude larger than the complexit y of the prop osed metho d. This can b e seen clearly in the computation times, as our approac h is slightly more exp ensive than linear SVM, but remains m uc h faster than other metho ds. Note moreov er that the soft-thresholding classification sc heme is very simple to implement in practice at test time, as it is a direct map that only in v olves max and linear op erations. 6.2 Sparsit y Sparsit y is a highly b eneficial prop erty in representation learning, as it helps decomp osing the factors of v ari- ations in the data in to high lev el features (Bengio et al, 2013; Glorot et al, 2011). T o assess the sparsity of the learned represen tation, we compute the av erage sparsity of our representation ov er all data p oints (training and testing combined) on the MNIST and CIF AR-10 dataset. W e obtain an av erage of 96 . 7% zeros in the MNIST case, and 95 . 3% for CIF AR-10. In other w ords, our represen tations are very sparse, without adding an explicit sparsity p enalization as in (Glorot et al, 2011). Interestingly , the rep orted av erage sparsity in (Glorot et al, 2011) is 83 . 4% on MNIST and 72 . 0% on CIF AR-10. Our one-lay er representation therefore exhibits an in teresting sparsit y prop ert y , while providing go o d predictive p erformance. 14 Complexity Time [s] Linear SVM O ( n ) 0.4 RBF k ernel SVM O ( nm ) 154 Sparse coding O  nN √   4 14 5 LAST classifier O ( nN ) 1.0 T able 5: Computational complexit y for classifying one test sample, and time needed to predict the lab els of the 10000 test samples in the MNIST dataset. F or reference, all the experiments are carried out on a 2.6 GHz In tel Core i7 mac hine with 16 GB RAM. 6.3 LAST vs. sto c hastic gradien t descent As discussed earlier, the soft-thresholding classification scheme b elongs to the more general neural net w ork mo dels. Neural netw orks are commonly optimized with sto chastic gradient descent algorithms, as opp osed to the DC metho d prop osed in this pap er. The prop osed learning algorithm has several adv antages compared to SGD: • Better lo cal minimum: In all our experiments, LAST reac hed a b etter solution than SGD in terms of the testing accuracy . This confirms the observ ations of T ao and An (1998) whereby DCA con v erges to “go od” lo cal minima, and often to global minima in practice. • Descent metho d: Unlik e sto c hastic gradient descent, LAST (and more generally DCA) is a descent metho d. Moreov er, it is guaran teed to conv erge to a critical p oin t (T ao and An, 1998). • No stepsize selection: Sto chastic gradient descent (and more generally gradient descent based algo- rithms) are very sensible to the difficult choice of the stepsize. Cho osing a large stepsize in SGD can b e b eneficial as it helps escaping local minimas, but it can also lead to an oscillatory b eha viour that preven ts con vergence. Interestingly , our optimization algorithm does not in volv e any stepsize selection, when giv en a con vex optimization solver. In fact, our algorithm solves a sequence of conv ex problems, which can b e solved with an y off-the-shelf conv ex solver. Note that ev en if the in termediate conv ex optimization problems are solved with a gradien t-descen t based technique, the choice of the stepsize is less challenging as w e ha ve a better understanding of the theoretical prop erties of stepsize rules in conv ex optimization problems. As w e ha ve previously men tioned, unlik e SGD, our algorithm assumes the sign v ector of the linear classifier w to b e kno wn. A simple heuristic choice of this parameter was shown how ev er to provide very go o d results in the exp eriments, compared to SGD. Of course, choosing this parameter with cross-v alidation might lead to b etter results, but also implies a slow er training procedure. 7 Conclusion W e ha ve prop osed a sup ervised learning algorithm tailored for the soft thresholding based classifier. The learning problem, whic h jointly estimates a discriminative dictionary D and a classifier hyperplane w is cast as a DC problem and solved efficiently with an iterativ e algorithm. The prop osed algorithm (LAST), which leverages the DC structure, significantly outp erforms sto chastic gradient descen t in all our exp erimen ts. F urthermore, the resulting classifier consisten tly leads to b etter results than the unsup ervised sparse co ding classifier. Our metho d moreo v er compares fav orably to other standard tec hniques as linear, RBF kernel or nearest neigh bour classifiers. The prop osed LAST classifier has also been shown to compete with recent discriminativ e sparse co ding techniques in handwritten digits classification experiments. W e should mention that, while the sparse co ding encoder features some form of competition betw een the differen t atoms in the dictionary (often referred to as explaining-away (Gregor and LeCun, 2010)), our enco der acts on the differen t atoms independently . Despite its simple b ehavior, our scheme is comp etitive when the dictionary and classifier parameters are learned in a suitable manner. The classification scheme adopted in this pap er can b e seen as a one hidden la y er neural netw ork with a soft-thresholding activ ation function. This activ ation function has recen tly gained significan t attention in the deep learning communit y , as it is b elieved to make the training pro cedure easier and less prone to bad local minima. Our work reveals an interesting structure of the optimization problem for the one-hidden lay er v ersion of that net work that allows to reach go o d minima. An interesting question is whether it is p ossible to find a 4 The complexity reported here is that of the FIST A algorithm Beck and T eb oulle (2009), where  denotes the required precision. Note that another popular metho d for solving sparse coding is the homotopy metho d, whic h is efficient in practice, ho wev er it has exponential theoretical complexit y Mairal and Y u (2012). 5 T o provide a fair comparison with our method, w e used dictionaries of the same size as for our prop osed approach, for the sake of this experiment. 15 similar structure for netw orks with man y hidden lay ers. This would help the training of deep net w orks, and offer insigh ts on this c hallenging problem, whic h is usually tackled using sto chastic gradien t descen t. A Soft-thresholding as an approximation to non-negativ e sparse co ding W e show here that soft-thresholding can b e viewed as a coarse approximation to the non-negative sparse co ding mapping (Denil and de F reitas, 2012). T o see this, w e consider the pro ximal gradient algorithm to solve the sparse co ding problem with additional nonnegativity constraints on the co efficients. Specifically , we consider the follo wing mapping argmin c ∈ R N k x − Dc k 2 2 + λ k c k 1 sub ject to c ≥ 0 . The pro ximal gradien t algorithm proceeds by iterating the following recursiv e equation to con v ergence: c k + 1 = pro x λt k·k 1 + I ·≥ 0 ( c k + t D T ( x − Dc k )) , where pro x is the pro ximal op erator, t is the c hosen stepsize and I ·≥ 0 is the indicator function, whic h is equal to 0 if all the components of the v ector are nonnegativ e, and + ∞ otherwise. Using the definition of the pro ximal mapping, w e ha v e pro x λt k·k 1 + I ·≥ 0 ( x ) , argmin u ≥ 0 { 1 2 k u − x k 2 2 + λt k u k 1 } = max(0 , x − λt ) . Therefore, imp osing the initial condition c 0 = 0 , and a stepsize t = 1, the first step of the proximal gradient algorithm can be written c 1 = max(0 , D T x − λ ) = h λ ( D T x ) , whic h precisely corresp onds to our soft-thresholding map. In this wa y , our soft-thresholding map corresp onds to an appro ximation of sparse coding, where only one iteration of pro ximal gradien t algorithm is performed. B Pro ofs B.1 Pro of of Prop osition 1 Before going through the proof of Prop osition 1, w e need the follo wing results in (Horst, 2000, Section 4 . 2): Prop osition 3 1. L et { f i } l i =1 b e DC functions. Then, for any set of r e al numb ers ( λ 1 , . . . , λ l ) , P l i =1 λ i f i is also DC. 2. L et f : R n → R b e DC and g : R → R b e c onvex. Then, the c omp osition g ( f ( x )) is DC. W e recall that the ob jective function of (P) is given b y: m X i =1 L   y i N X j =1 s j q ( u T j x i − v j )   + ν 2 k v k 2 2 , The function k v k 2 2 is con v ex and therefore DC. W e sho w that the first part of the ob jective function is also DC. W e rewrite this part as follo ws: m X i =1 L   X j : s j = y i q ( u T j x i − v j ) − X j : s j 6 = y i q ( u T j x i − v j )   . Since q is conv ex, q ( u T j x i − v j ) is also conv ex (Boyd and V andenberghe, 2004). As the loss function L is conv ex, w e finally conclude from Prop osition 3 that the ob jectiv e function is DC. Moreov er, since the constrain t v ≥  is con v ex, w e conclude that (P) is a DC optimization problem. 16 B.2 Pro of of Prop osition 2 W e now suppose that L ( x ) = max(0 , 1 − x ), and deriv e the DC form of the ob jectiv e function. W e ha ve: m X i =1 L  y i N X j =1 s j q ( u T j x i − v j )  = m X i =1 max  0 , 1 + X j : s j 6 = y i q ( u T j x i − v j ) − X j : s j = y i q ( u T j x i − v j )  = m X i =1 max  X j : s j = y i q ( u T j x i − v j ) − X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j ) − X j : s j = y i q ( u T j x i − v j )  = m X i =1 max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  − m X i =1 X j : s j = y i q ( u T j x i − v j ) . The ob jective function of (P) can therefore b e written as g − h , with: g = ν 2 k v k 2 2 + m X i =1 max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  , h = m X i =1 X j : s j = y i q ( u T j x i − v j ) , where g and h are con v ex functions. Ac knowledgmen ts The authors w ould like to thank the associate editor and the anonymous reviewers for their v aluable comments and references that helped to improv e the quality of this pap er. References Ak ata Z, Perronnin F, Harc haoui Z, Schmid C (2014) Go o d practice in large-scale learning for image classifica- tion. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e pp 507–520 An L TH, T ao PD (2005) The DC (difference of con v ex functions) programming and DCA revisited with dc mo dels of real w orld nonconv ex optimization problems. A nnals of Op er ations R ese ar ch 133(1-4):23–46 Bec k A, T eb oulle M (2009) A fast iterative shrink age-thresholding algorithm for linear in v erse problems. SIAM Journal on Imaging Scienc es 2(1):183–202 Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new p ersp ectives. IEEE T r ans- actions on Pattern Analysis and Machine Intel ligenc e 35(8):1798–1828 Bishop CM (1995) Neural Net works for P attern Recognition. Oxfor d University Pr ess, Inc. Bo yd S, V andenberghe L (2004) Convex optimization . Cambridge Univ ersity Press Burges C (1998) A tutorial on support vector machines for pattern recognition. Data mining and know le dge disc overy 2(2):121–167 Chang CC, Lin CJ (2011) LIBSVM: A library for supp ort vector machines. ACM T r ansactions on Intel ligent Systems and T e chnolo gy 2:27:1–27:27 Chen CF, W ei CP , W ang YC (2012) Low-rank matrix recov ery with structural incoherence for robust face recognition. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 2618–2625 17 Coates A, Ng A (2011) The imp ortance of enco ding versus training with sparse coding and v ector quantization. In: International Confer enc e on Machine L e arning (ICML) , pp 921–928 Denil M, de F reitas N (2012) Rec klessly appro ximate sparse coding. arXiv pr eprint arXiv:12080959 Elad M, Aharon M (2006) Image denoising via sparse and redundan t represen tations o ver learned dictionaries. IEEE T r ansactions on Image Pr o c essing 15(12):3736–3745 F adili J, Starc k JL, Murtagh F (2009) Inpainting and zooming using sparse representations. The Computer Journal 52(1):64–79 F an RE, Chang KW, Hsieh CJ, W ang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. Journal of Machine L e arning R ese ar ch 9:1871–1874 Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier netw orks. In: International Confer enc e on Artificial Intel ligenc e and Statistics (AIST A TS) , vol 15, pp 315–323 Gregor K, LeCun Y (2010) Learning fast appro ximations of sparse co ding. In: International Confer enc e on Machine L e arning (ICML) , pp 399–406 Horst R (2000) Intr o duction to glob al optimization . Springer Huang K, Aviy en te S (2006) Sparse represen tation for signal classification. In: A dvanc es in Neur al Information Pr o c essing Systems , pp 609–616 Hull JJ (1994) A database for handwritten text recognition research. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 16(5):550–554 Ka vukcuoglu K, Ranzato M, LeCun Y (2010a) F ast inference in sparse co ding algorithms with applications to ob ject recognition. arXiv pr eprint arXiv:10103467 Ka vukcuoglu K, Sermanet P , Boureau YL, Gregor K, Mathieu M, LeCun Y (2010b) Learning con v olutional feature hierarchies for visual recognition. In: A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , pp 1090–1098 Krizhevsky A, Hin ton G (2009) Learning multiple la yers of features from tiny images. Master’s thesis, Depart- men t of Computer Science, Univ ersit y of T oron to Laro c helle H, Mandel M, Pascan u R, Bengio Y (2012) Learning algorithms for the classification restricted b oltzmann mac hine. The Journal of Machine L e arning R ese ar ch 13:643–669 LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to do cument recognition. Pr o c e e dings of the IEEE 86(11):2278–2324 Ma L, W ang C, Xiao B, Zhou W (2012) Sparse representation for face recognition based on discriminativ e lo w-rank dictionary learning. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 2586–2593 Maas A, Hannun A, Ng A (2013) Rectifier nonlinearities impro v e neural net work acoustic mo dels. In: Interna- tional Confer enc e on Machine L e arning (ICML) Mairal J, Y u B (2012) Complexit y analysis of the lasso regularization path. In: International Confer enc e on Machine L e arning (ICML) , pp 353–360 Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Sup ervised dictionary learning. In: A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , pp 1033–1040 Mairal J, Bach F, P once J, Sapiro G (2010) Online learning for matrix factorization and sparse co ding. The Journal of Machine L e arning R ese ar ch 11:19–60 Mairal J, Bac h F, P once J (2012) T ask-driven dictionary learning. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 34(4):791–804 Raina R, Battle A, Lee H, Pac k er B, Ng A Y (2007) Self-taugh t learning: transfer learning from unlab eled data. In: International Confer enc e on Machine L e arning (ICML) , pp 759–766 Ramirez I, Sprec hmann P , Sapiro G (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 3501–3508 18 Sha we-T aylor J, Cristianini N (2004) Kernel metho ds for p attern analysis . Cambridge Universit y Press Srip erum budur BK, T orres DA, Lanc kriet GR (2007) Sparse eigen metho ds by DC programming. In: Interna- tional Confer enc e on Machine le arning (ICML) , pp 831–838 T ao PD, An L TH (1998) A DC optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization 8(2):476–505 V alkealah ti K, Oja E (1998) Reduced m ultidimensional co-o ccurrence histograms in texture classification. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 20(1):90–94 Figueras i V en tura R, V andergheynst P , F rossard P (2006) Lo w-rate and flexible image coding with redundan t represen tations. IEEE T r ansactions on Image Pr o c essing 15(3):726–739 W right J, Y ang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 31(2):210–227 Y ang J, Y u K, Gong Y, Huang T (2009) Linear spatial pyramid matc hing using sparse co ding for image classification. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 1794–1801 Y uille A, Rangara jan A, Y uille A (2002) The concav e-conv ex procedure (CCCP). In: A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , vol 2, pp 1033–1040 Zeiler M, Ranzato M, Monga R, Mao M, Y ang K, Le Q, Nguyen P , Senior A, V anhouc k e V, Dean J, Hinton G (2013) On rectified linear units for sp eec h pro cessing. In: IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pp 3517–3521 Zhang Y, Jiang Z, Davis L (2013) Learning structured low-rank representations for image classification. In: IEEE Confer enc e on Computer Vision a nd Pattern R e c o gnition (CVPR) , pp 676–683 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment